« Database Optimize patterns | Main | Distributed content system with bandwidth balancing »
Monday
May252009

non-sequential, unique identifier, strategy question

(Please bare with me, I'm a new, passionate, confident and terrified programmer :D )

Background:
I'm pre-launch and 1 year into the development of my application. My target is to be able to eventually handle millions of registered users with 5-10% of them concurrent. Up to this point I've used auto-increment to assign unique identifiers to rows. I am now considering switching to a non-sequential strategy. Oh, I'm using the LAMP configuration.

My reasons for avoiding auto-increment:
1. Complicates replication when scaling horizontally. Risk of collision is significant (when running multiple masters). Note: I've read the other entries in this forum that relate to ID generation and there have been some great suggestions -- including a strategy that uses auto-increment in a way that avoids this pitfall... That said, I'm still nervous about it.
2. Potential bottleneck when retrieving/assigning IDs -- IDs assigned at the database.

My reasons for being nervous about non-sequential IDs:
1. To guarantee uniqueness, the IDs are going to be much larger -- potentially affecting performance significantly

My New Strategy:
(I haven't started to implement this... I'm waiting for someone smarter than me to steer me in the right direction)
1. Generate a guaranteed-unique ID by concatenating the user id (1-9 digits) and the UNIX timestamp(10 digits).
2. Convert the resulting 11-19 digit number to base_36. The resulting string will be alphanumeric and 6-10 characters long. This is, of course, much shorter (at least with regard to characters) then the standard GUID hash.
3. Pass the new identifier to a column in the database that is type CHAR() set to binary.

My Questions:
1. Is this a valid strategy? Is my logic sound or flawed? Should I go back to being a graphic designer?
2. What is the potential hit to performance?
3. Is a 11-19 digit number (base 10) actually any larger (in terms of bytes) than its base-36 equivalent?

I appreciate your insights... and High Scalability for supplying this resource!




Reader Comments (10)

Is that unix user ID? In most cases apps always run under the same user so that part of the value would always be the same. The time stamp isn't reassuringly unique, especially on the same box when many different threads are getting time stamps at the "same time." So I'm not sure it's a good plan.

December 31, 1999 | Unregistered CommenterTodd Hoff

Here's an option: http://simonwillison.net/2009/May/25/uuid/ - serves up incrementing integers over a socket

December 31, 1999 | Unregistered CommenterTodd Hoff

Thanks for your responses, Todd :D

"Is that unix user ID?"
No, it's an auto-incrementing numeric ID given to each unique user when they register and confirm their account (i.e. "UserA = 222", "UserB = 333", "UserC = 444" **I chose these numbers for simplicity, they are not meant to convey any sequenced pattern)

Example:
UserB makes a comment. The unique ID to represent that comment would be created in this way...
1. Retrieve the user's ID ("222").
2. Retrieve the current timestamp ("1234567890").
3. Concatenate the two values ("2221234567890").
(To your point, it wouldn't matter if 100 other users were given the same timestamp, because of the concatenation, the resulting number would be absolutely unique. Each user would have approximately 10 billion unique IDs completely to themselves. The only possible conflict would be if UserB could make more than one comment per second (which is forbidden by design).)

continuing...
4. Convert to base 36 (i.e. "ahen9s7").
5. Insert into a binary CHAR() column.

I appreciate your thoughts on the first part of the plan. Assuming I could guarantee uniqueness, do you have any thoughts with regards to steps 4 or 5?

Thanks!

December 31, 1999 | Unregistered Commenterzorumbus

Obviously you have considered the performance penalty of having a non integer key. Some people say who cares some people say it matters a lot.

I assume the base36 stuff is because it will be part of a URL?

IMHO not even nanosecond timestamps are good enough if uniqueness is absolutely required. Comments will be coming fast, threads block on a system call, and it's easy for multiple threads to get the same value. That's what I've experienced under load anyway. For comments it may not matter.

December 31, 1999 | Unregistered CommenterTodd Hoff

"I assume the base36 stuff is because it will be part of a URL?"

Actually, no. They will appear in the URL long enough for the page to grab the value and throw it into a temporary "SESSION" variable... then the URL will be stripped of it's values (i.e. "subject.php?b=13u89shg" will become "subject.php").

My assumption (though probably flawed) was that a 7-digit hex value would be "faster" than it's 14-digit INT counterpart. Of course, if it is being stored as a "binary", then both the hex value and the INT would be converted to "1"s and "0"s... (at least, that's what I'm assuming)... and, in that case, they are the exact same thing.

I just don't know the performance difference between a shorter hex versus a longer INT...

For queries, is it better to have "33210921283762" or "eh7s3n0"? (or, are they both really just "110101100001010010010111001010010010010010001"?)

Thanks!

December 31, 1999 | Unregistered Commenterzorumbus

Does anyone know good reliable algorithm to generate numeric IDs in the cluster that would not overlap across multiple DBs?

The best thing I know is to create "NextId" table and keep stuff there, processes then request IDs one-by-one or in batches of X. It works like a charm, until one need to have Y more DB servers, now there is a tough choice - keep ID generation global, one DB host servers whole cluster or do something special.
The best "special" thing that comes to my mind is dedicating say two highest (lowest) bits to DB host ID, so ID generation now happens in non-overlapping bands on multiple DB hosts simultaneously.
Example
host A: from 0100 0000 to 0111 1111
host B: from 1000 0000 to 1011 1111
host C: from 1100 0000 to 1111 1111

Any better ideas?

December 31, 1999 | Unregistered Commenterikatkov

We actually use base 34 for a ton of stuff; we fold all zeros and ones to Os and Is. Our input/outputs (search, etc) do the same thing whenever we get data from an end-user (we use those for confirmation codes). This loses a fairly small amount of code bandwidth, but gets us almost completely user-enterable data.

December 31, 1999 | Unregistered CommenterRichard

I see the following performance penalties in this strategy.

1. Calculation of the Id.
2. Subsequent searches in the database would be string matching compared to integer matching. Integer matching is obviously way quicker than string matching.

In our system, we use Base64 encoded Ids for storing user queries. This is purely for the ease of use when the user queries us back. It didn't make sense to use the strategy for the entire schema.
We used timestamp concat (search.hashCode%100) and I can say that we get about 0.25% clash in first attempt generation. We insert about 4000 records a day in that table. Obviously, it isn't working and needs improvement.

December 31, 1999 | Unregistered CommenterRam

http://highscalability.com/id-generation-schemes

Check out my post on there, very similar to what you're thinking of. I've using this in production and it works a treat.

December 31, 1999 | Unregistered CommenterBrian Drought

> http://highscalability.com/id-generation-schemes
indeed it's very similar. How do you add/removed new shards? What happens if there is more shards that these special few bits you allocated allow you to reference. What happens if one shard's counter flipovers?

December 31, 1999 | Unregistered Commenterikatkov

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>