Generating unique IDs in distributed system

Recently, I’ve been working on developing distributed application based on microservices. One problem I’ve ran into with this project is the way to generate unique IDs for entities within application. Although it seems like a trivial problem, there are some requirements that made this a bit tougher to crack.

  • system should work with multiple databases, so we can not use DB auto increment support, since it should be possible to import records from various DBs into one
  • IDs should be human readable (alphanumeric characters) and should have a length of 8-12 characters (fixed length)
  • IDs should be sequential, meaning that they can be ordered from least to most recent
  • should be unique within the system (don’t care about being globally unique)

Available options

The first options that comes to mind is to use UUIDS, since they are virtually guaranteed to be unique. Java (system is written in Java) provides native support for UUIDs and they can be created easily. Unfortunately, UUIDs are 128 bits in length, which gives much longer IDs then we need. In addition, they are not sequential. We could use Type 1 (sequential)  UUIDs for this, they are still too long for the requirements.

Another option I thought about is using timestamps. The problem with them is that they are also too long, plus the possibility of collision is high. Since clocks in various system nodes may not be synchronized, that gives us additional problems to consider. Also, timestamps are measured from epoch time, and will reach it’s upper limit in near future (Year 2038 problem). This probably not important at this point in time, but I would like to have the system be able to run indefinitely.

One more option that comes to mind is to use centralized authority that would issue IDs of correct length and guarantee that there is no collision. From aspect of guaranteed uniqueness, this is the best option, but it adds additional overhead of maintaining this central authority (like database or something). Additionally, it introduces another dependency for components of the system, and they should be as loosely coupled as possible.Finally, we need a way to distinguish between IDs generated in s

So, after much deliberation, we decided to use kind of hybrid combination, consisting of timestamp and some random number with relatively low possibility of collision. This would give us sequential property, and also allow us to keep the length of identifiers in desired range.

Using partial timestamps

The way we decided to go is using partial timestamps, ie. number of seconds since the beginning of the current year. Since the year has maximum of 366 * 24 * 3600 seconds (in leap year), that gives us over 30 million seconds. Turns out that this number of seconds can be representing by a 25 bits. It is important to note that all timestamp must be calculated in the same timezone, because our system can be geographically dispersed. So, we will use UTC time zone to calculate all timestamps.

The caveat here is that, if your system runs for more then a year, your timestamps will roll over and start from 0 again. Since we want IDs  to be sequential, we need to make sure that the ones generated in later year appear after previous ones. In order to do that, we prefix each ID with a ordinal number of year it was generated in, starting from 0. This way we can keep IDs in order based on time they are created. Using 8 bits for this part will give us 256 years in which system can run, which is more then enough for any current software.

But, there is high possibility that a lot of IDs are generated per second, so we need a way to distinguish them. We do this by generating random number which we will append to timestamp. Here, we must consider a trade off: large random number have lower possibility of collision, while they add extra length to our IDs. So, we must be carefully consider the requirements of application here. How many IDs will the system have to generate each second? Will all IDs be used for same entities? How do we handle collision within the system?

Representation of identifiers

As specified in requirements, IDs must be human readable and of fixed length, in range of 8-12 alphanumeric characters. This means that we need to encode generated bits into some human-friendly format. Obvious choice for this would be HEX numbers, and it is what I would prefer.  But, using hexadecimal characters does pose some limit regarding the length.

For example, in order to encode 25 bits as hex number, we would need 7 characters. If we add, for example, another 8 bits to represent current year, that is another 2 characters, making the total of 9. That leaves us with 3 extra characters to use for random part of ID. This means 12-bit number, which is about 4K possible number, which means possibility of collision is high.

We could use Base-64 encoding to reduce the scope, but it contains some non-alphanumeric characters. So, as compromise, Base-62 encoding looks like best approach. In this case, it will take 5 characters for timestamp part and 2 for year, since we want all our IDs to have the same length. This leaves us with 5 characters for random part, which gives us up to 30 bits, but we can make it 28 just to be safe, since with 30 bits Base-62 could generate one extra character for some numbers. With 28 bits, we have over 268 million possible combinations!

So, in summary, identifiers would have the following format:

  • 8 bits for year (2 characters in Base62 encoding)
  • 25 bits for timestamp (5 characters in Base62 encoding)
  • 28 bits for random number (5 characters in Base62 encoding)

This gives us 12 character long identifiers which should be sufficiently unique for most purposes.

Probability of collision

Theoretically, this system can generate over 200 million unique IDs per second. Realistically, it will have about 1000-2000 collisions for 1 million IDs per second, using Java Random. This represents collision rate of 0.1 -0.2 percent. For smaller number of IDs per second, collision rate decreases significantly. For example, 10,000 IDs per second will produce almost no collision.

This is all based on running sample code in single JVM on single machine. Running on multiple machines may probably give different results. For inquisitive types, there is mathematical explanation on how to calculate collision probability here.

Generally, I would say it would be possible to generate 5-10K unique IDs per second using this system. This should be more then enough for most applications, unless you have really massive scale to take care of. In that case, you can tweak the system to use larger random numbers, thus reducing the possibility of collision. Of course, this will make IDs longer, so it’s all about trade off. If you absolutely need to have guaranteed uniqueness, then using some other approach (like UUIDs) is probably better idea.

Comments

This is one approach we are using to generate unique IDs, and I would really like to hear comments from people who have experience with this kind of things. For now, this approach works fine and I am not aware of any pitfalls we could run into while using it. Of course, we might be overlooking something and are open to comments and suggestions. If you can spot problems with this approach, please speak up!

Sample source code for this can be found on Github.

 

Leave a Reply

Your email address will not be published. Required fields are marked *