unique id by ulid vs. uuid

1,913 阅读6分钟

www.red-gate.com/simple-talk…

What is the best way to handle big number returning from database in nodeJS?

stackoverflow.com/questions/5…

www.cryptosys.net/pki/uuid-rf…

Storing UUID Values in MySQL

www.percona.com/blog/2014/1…

What is UUID?

Generate a UUID compliant with RFC 4122

A Universally Unique IDentifier (UUID) - also called a Global Unique IDentifier (GUID) - is a 128-bit value formatted into blocks of hexadecimal digits separated by a hyphen ("-", U+002D). A typical UUID is :

AA97B177-9383-4934-8543-0F91A7A02836. 

It doesn't matter whether the letters A-F are upper or lower case.

A version 4 UUID is defined in RFC 4122: 128 randomly-generated bits with six bits at certain positions set to particular values. For example,

AA97B177-9383-4934-8543-0F91A7A02836 ^ ^ 1 2

The digit at position 1 above is always "4" and the digit at position 2 is always one of "8", "9", "A" or "B".

How to generate a version 4 UUID ?

The procedure to generate a version 4 UUID is as follows:

  1. Generate 16 random bytes (=128 bits)
  2. Adjust certain bits according to RFC 4122 section 4.4 as follows:
  • set the four most significant bits of the 7th byte to 0100'B, so the high nibble is "4"
  • set the two most significant bits of the 9th byte to 10'B, so the high nibble will be one of "8", "9", "A", or "B" (see Note 1).
  1. Encode the adjusted bytes as 32 hexadecimal digits
  2. Add four hyphen "-" characters to obtain blocks of 8, 4, 4, 4 and 12 hex digits

Output the resulting 36-character string "XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX"

source code of C#

也没有那么困难,是这样的。对不?

// $Id: Uuid.cs $
// $Date: 2012-09-09 07:47Z $
// $Revision: 1.0 $
// $Author: dai $
// This module uses functions from the CryptoSys (tm) PKI Toolkit available from
// <www.cryptosys.net/pki/>.
// Include a reference to `diCrSysPKINet.dll` in your project.
// REFERENCE:
// RFC 4122 "A Universally Unique IDentifier (UUID) URN Namespace", P. Leach et al,
// July 2005, <http://www.ietf.org/rfc/rfc4122.txt>.
using System;
using CryptoSysPKI;
static class Uuid
{
    public static string UUID_Make()
    {
        //                                           12345678 9012 3456 7890 123456789012
        // Returns a 36-character string in the form XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX
        // where "X" is an "upper-case" hexadecimal digit [0-9A-F].
        // Use the LCase function if you want lower-case letters.
        byte[] abData = null;
        string strHex = null;
        // 1. Generate 16 random bytes = 128 bits
        abData = Rng.Bytes(16);
        // DEBUGGING...
        //'Console.WriteLine("RNG=" & Cnv.ToHex(abData))
        // 2. Adjust certain bits according to RFC 4122 section 4.4.
        // This just means do the following
        // (a) set the high nibble of the 7th byte equal to 4 and
        // (b) set the two most significant bits of the 9th byte to 10'B,
        //     so the high nibble will be one of {8,9,A,B}.
        abData[6] = (byte)(0x40 | ((int)abData[6] & 0xf));
        abData[8] = (byte)(0x80 | ((int)abData[8] & 0x3f));
        // 3. Convert the adjusted bytes to hex values
        strHex = Cnv.ToHex(abData);
        // DEBUGGING...
        //'Console.WriteLine("ADJ=" & Cnv.ToHex(abData))
        //'Console.WriteLine("                ^   ^") ' point to the nibbles we've changed
        // 4. Add four hyphen '-' characters
        //'strHex = Left$(strHex, 8) & "-" & Mid$(strHex, 9, 4) & "-" & Mid$(strHex, 13, 4) _
        //'    & "-" & Mid$(strHex, 17, 4) & "-" & Right$(strHex, 12)
        strHex = strHex.Substring(0, 8) + "-" + strHex.Substring(8, 4) + "-" + strHex.Substring(12, 4) + "-" + strHex.Substring(16, 4) + "-" + strHex.Substring(20, 12);
        // Return the UUID string
        return strHex;
    }
    public static void Main()
    {
        string strUuid = null;
        int i = 0;
        for (i = 1; i <= 10; i++) {
            strUuid = UUID_Make();
            Console.WriteLine("{0}", strUuid);
        }
    }
}

Which UUID version to use?

stackoverflow.com/questions/2…

There are two different ways of generating a UUID.

If you just need a unique ID, you want a version 1 or version 4.

Version 1: This generates a unique ID based on a network card MAC address and a timer. These IDs are easy to predict (given one, I might be able to guess another one) and can be traced back to your network card. It's not recommended to create these.

Version 4: These are generated from random (or pseudo-random) numbers. If you just need to generate a UUID, this is probably what you want.

If you need to always generate the same UUID from a given name, you want a version 3 or version 5.

Version 3: This generates a unique ID from an MD5 hash of a namespace and name. If you need backwards compatibility (with another system that generates UUIDs from names), use this.

Version 5: This generates a unique ID from an SHA-1 hash of a namespace and name. This is the preferred version.

优化 optimal for UUID version 1

A few years ago Peter Zaitsev, in a post titled “To UUID or not to UUID,” wrote: “There is timestamp based part in UUID which has similar properties to auto_increment and which could be used to have values generated at the same point in time physically local in BTREE index.”

For this post, I’ve rearranged the timestamp part of UUID (Universal Unique Identifier) and did some benchmarks.

Many people store UUID as char (36) and use as row identity value (PRIMARY KEY) because it is unique across every table, every database and every server and allow easy merging of records from different databases. But here comes the problem, using it as PRIMARY KEY causes the problems described below.

Problems with UUID

UUID has 36 characters which make it bulky.InnoDB stores data in the PRIMARY KEY order and all the secondary keys also contain PRIMARY KEY. So having UUID as PRIMARY KEY makes the index bigger which cannot be fit into the memory

Inserts are random and the data is scattered.Despite the problems with UUID, people still prefer it because it is UNIQUE across every table, can be generated anywhere. In this blog, I will explain how to store UUID in an efficient way by re-arranging timestamp part of UUID.

Structure of UUID

MySQL uses UUID version 1 which is a 128-bit number represented by a utf8 string of five hexadecimal numbers.

The 1-3 numbers are generated from a timestamp.

The 4 number preserves temporal uniqueness in case the timestamp value loses monotonicity (for example, due to daylight saving time).

The 5 number is an IEEE 802 node number that provides spatial uniqueness. A random number is substituted if the latter is not available (for example, because the host computer has no Ethernet card, or we do not know how to find the hardware address of an interface on your operating system). In this case, spatial uniqueness cannot be guaranteed. Nevertheless, a collision should have a very low probability.

The timestamp is mapped as follows:

When the timestamp has the (60 bit) hexadecimal value:

1d8 eebc 58e0a7d7. 

The following parts of the UUID are set::

58e0a7d7–eebc–11d8-9669-0800200c9a66. 

Note:

  • The 1 before the most significant digits (in 11d8) of the timestamp indicates the UUID version, for time-based UUIDs this is 1.
  • 1st three numbers are based on timestamp, so they will be monotonically increasing.
  • 4th and 5th parts would be mostly constant if it is generated from a single server.

Let’s rearrange the total sequence making the UUID closer to sequential. This makes the inserts and recent data lookup faster.

  1. Dashes (‘-‘) make no sense, so let’s remove them. and
  2. rearrange the total sequence

so result is

58e0a7d7-eebc-11d8-9669-0800200c9a66 => 
11d8eebc58e0a7d796690800200c9a66

Benchmarking for insert time when db size increase

I created three tables

  • events_uuid – UUID binary(16) PRIMARY KEY
  • events_int – Additional BIGINT auto increment column and made it as primary key and index on UUID column
  • events_uuid_ordered – Rearranged UUID binary(16) as PRIMARY KEY

I created three stored procedures which insert 25K random rows at a time into the respective tables. There are three more stored procedures which call the random insert-stored procedures in a loop and also calculate the time taken to insert 25K rows and data and index size after each loop. Totally I have inserted 25M records.

The data size for UUID table is more than the other two tables. Index Size Horizontal axis – Number of inserts x 25,000 Vertical axis – Index Size in MB

For the table with UUID as PRIMARY KEY, you can notice that as the table grows big, the time taken to insert rows is increasing almost linearly. Whereas for other tables, the time taken is almost constant.

The size of the UUID table is almost 50% bigger than Ordered UUID table and 30% bigger than the table with BIGINT as PRIMARY KEY. Comparing the Ordered UUID table BIGINT table, the time is taken to insert rows and the size are almost the same. But they may vary slightly based on the index structure.

Introduce ULID

UUID can be suboptimal for many uses-cases because:

It isn't the most character efficient way of encoding 128 bits of randomness

UUID v1/v2 is impractical in many environments, as it requires access to a unique, stable MAC address
UUID v3/v5 requires a unique seed and produces randomly distributed IDs, which can cause fragmentation in many data structures
UUID v4 provides no other information than randomness which can cause fragmentation in many data structures
Instead, herein is proposed ULID:

128-bit compatibility with UUID

1.21e+24 unique ULIDs per millisecond
Lexicographically sortable!
Canonically encoded as a 26 character string, as opposed to the 36 character UUID
Uses Crockford's base32 for better efficiency and readability (5 bits per character)
Case insensitive
No special characters (URL safe)
Monotonic sort order (correctly detects and handles the same millisecond)

save to db

Since ulids are sortable, is it ok to store them as char(26) in mysql/pgsql... when used as primary keys or should they rather be stored in some other form(binary/bytes...)?

See GUIDs as fast primary keys under multiple databases So, string format is prefferable for MySQL and for PostgreSQL