design-rationale.txt - tesid blob - Chris Morgan’s Git repositories

   1 ======================
   2 TESID design rationale
   3 ======================
   4
   5 TESID is made up of three parts:
   6
   7  2. Encryption,
   8  3. String encoding, and
   9  1. A convention for reducing occupation of the ID space
  10     and attaching a discriminant for types or other such details.
  11
  12 (I’ve numbered these items in their encoding order,
  13 but I’m going to discuss them in this altered order.
  14 Things would get boring if list item numbers always had to be in order.)
  15
  16 See <algorithms.txt> for a basic overview of it all.
  17 This file explains why things are the way they are.
  18
  19 2. 5peck: Speck, but with five-based word sizes and just one key
  20 ================================================================
  21
  22 Encryption is done with some new instances of the Speck cipher family,
  23 which I have called 5peck (because it amused me).
  24 5peck is pronounced “fpeck”, not to be confused with “ſpeck”.
  25 (This also amuses me.)
  26 In places where an identifier is required that cannot start with a numeral, I use fpeck.
  27
  28 There are two particular properties of 5peck ciphers:
  29
  30  • All instances share an expanded key,
  31    taking as many bits as they require of each round key.
  32    This is a significant deviation from Speck for small word sizes,
  33    since there’s not much point in longer keys in usual lightweight block cipher situations,
  34    due to other weaknesses like birthday attacks.
  35    But such reasons are inapplicable to 5peck.
  36
  37  • Word sizes are multiples of five,
  38    in order to match the base-32 alphabet that follows (32 is five bits).
  39    This is not normal because it’s more computationally expensive
  40    with no real benefit for normal cryptographic purposes,
  41    but there’s no reason not to do it when you have a reason like I do.
  42
  43 5peck ciphers don’t get names of their own (unlike Speck32/64, Speck128/128, &c.).
  44 Wouldn’t want them getting self-important or anything.
  45
  46 In the reference implementation, all 5peck ciphers use a 64-bit machine word,
  47 with appropriate rotation and wrapping addition and subtraction via masking,
  48 but all that is implementation detail.
  49
  50 For key scheduling: a 128-bit key is expanded with n=64, m=2, T=30, α=9, β=2.
  51 Encryption takes the n least significant bits of each round key.
  52
  53 5peck is defined for word sizes n={10, 15, 20, 25, 30, 35, 40, 45, 50}.
  54 In all cases, T=30, α=9, β=2, and m is not applicable.
  55
  56 The notable omission from the sequence is n=5,
  57 which would produce 1024 two-character TESIDs.
  58 I omitted it because it’s not *that* useful,
  59 I’d need to support a different value of at least α,
  60 and I was concerned about its integrity,
  61 given how extremely it deviates from the pattern I see in the Speck paper.
  62 (To reach 128-bit security, I think it’d need at least a few more rounds,
  63 T=30 yielding only 150 bits in its expanded key.)
  64 So instead I start at n=10 and four-character TESIDs.
  65
  66 Justification for the single expanded key and T=30
  67 --------------------------------------------------
  68
  69 Probably the most significant deviation from Speck is the 128-bit key used all round:
  70 does it or can it lead to 128-bit security, even at n=10?
  71 And how many rounds do we need?
  72
  73 These are the numbers of rounds for the defined Speck instances:
  74
  75  • For n=16: T = 18 + m
  76  • For n=24: T = 19 + m
  77  • For n=32: T = 23 + m
  78  • For n=48: T = 26 + m
  79  • For n=64: T = 30 + m
  80
  81 I’m going to generalise and call this T = w + m,
  82 where w is very approximately proportional to n + c for some constant c.
  83 n=24 is a mild outlier from my predictions: I’d have expected w=20, not w=19.
  84 And assuming that w is *linearly* correlated to n is hubris.
  85
  86 In the addition rather than multiplication of m,
  87 it is clearly seen that security is not particularly strongly proportional to the number of round key bits;
  88 Speck128/128 has 16 bits of round key per bit of security,
  89 but Speck128/256 has only 8.5 bits of round key per bit of security,
  90 and Speck32/64 has 5.5 bits of round key per bit of security.
  91 You clearly can’t take this too far, as there’s an obvious hard lower bound of 1,
  92 and imperfections in diffusion will mean that the real lower bound is higher.
  93 (Skipping to the end state:
  94 for n=10 I end up with 2.34 bits of round key per claimed bit of security,
  95 which is not a huge amount of margin, but still sufficient, I think.
  96 n=5 would have been 1.17, which I suspect would be insufficient.)
  97
  98 (I’m treating the number of bits of security as equivalent to the number of bits of key,
  99 because that is the intent of the cipher.)
 100
 101 Well then, the number of rounds.
 102 Or actually, the key expansion technique, that’s important too.
 103
 104 Originally, I expanded the key for each value of n,
 105 by consuming one word of the key until it was all gone
 106 (plus however many extra zero bits were needed at the end);
 107 this led to m=ceil(128 / n).
 108 I also tentatively chose T=30,
 109 leading to values of w from 17 (for n=10) to 27 (for n=50).
 110 This seemed fine for a while,
 111 but I wasn’t enthusiastic about having to store 2160 bytes of expanded keys
 112 (as I was using u64 for all n), and n=10’s m=13 was rather a lot.
 113 Then I thought of just expanding it once, with n=64, m=2,
 114 and during encryption taking the n bits of each round key.
 115 The purpose of key expansion is to diffuse the key bits into the round keys,
 116 so this shouldn’t reduce security in any way.
 117 And it gets the whole key into the mix sooner,
 118 so it should lead to *increased* security in the round keys for small values of n.
 119 This effectively reduces the very high values of m for small values of n,
 120 so that my w is higher for them.
 121 This suggests that I could probably reduce the number of rounds for smaller n.
 122 A nice simple solution here is T = 20 + n/5,
 123 which stays higher than Speck, but by much less.
 124 The most important one to consider here is n=10: this takes it down to T=22,
 125 meaning 220 bits of round key.
 126 This is heading too low for my liking.
 127 With this approach to key scheduling,
 128 I suspect you may need more rounds to fit this much security in.
 129
 130 (This is why I eliminated n=5: it’s hard to get such high security out of such a block size with this type of cipher; even n=10 may be pushing it.)
 131
 132 And really, performance is sufficiently excellent that it’s not worth fussing about reducing rounds by 25%.
 133 My Rust benchmark of encrypting all n=10 values (2²⁰ of them, around a million)
 134 shows it taking under half a nanosecond for each one (under 0.5ms for all) on my laptop,
 135 and this will hold for all block sizes, since all are using u64 machine words with masking.
 136 So I’ll just leave it a flat 30 rounds,
 137 which I’m fairly confidently is sufficient for 128-bit security at n=10,
 138 and significant overkill for n=15 to n=40 or so.
 139 I’ve run the numbers in a few different ways and am as confident in the results as I could hope to be as a crypto layman.
 140
 141 Justification for α = 9, β = 2
 142 ------------------------------
 143
 144 Speck’s decision for their rotation values is explained in
 145 <https://eprint.iacr.org/2017/560.pdf#page=8>:
 146 they started with (7, 2) and it worked well,
 147 but switched to (8, 3), which was still good enough,
 148 for better x86 performance due to SIMD byte-shuffle,
 149 except for Speck32 where (8, 3) was too low-quality on a 16-bit word
 150 and they didn’t want to increase the number of rounds to compensate.
 151 They acknowledge that there are other values like (9, 2) that are better in a vacuum,
 152 but that the attacks they handle better aren’t security-limiting attacks,
 153 and that performance is worse on some platforms.
 154
 155 In 5peck, word sizes are multiples of 5,
 156 so the SIMD byte-shuffle argument fails,
 157 as do the other performance arguments.
 158 Intuition suggested to me that α − β = 5 might suffer slightly;
 159 some basic cryptanalysis concurred and suggested (9, 2) works well.
 160 (I use that for the n=64, m=2 key expansion too.)
 161 In related news, it *looks* to me from some basic CryptoSMT analysis like multiples of five for n fare better than multiples of eight.
 162 Not sure if I’m barking up the wrong tree of if that’s a real effect.
 163 If you happen to be a cryptographer, I’d quite like to know more, for curiosity’s sake.
 164
 165 Other remarks
 166 -------------
 167
 168 I do not know if my key reuse across ciphers weakens anything.
 169 In principle, I don’t believe it should,
 170 especially given that a particular value will only ever be encrypted with one value of n.
 171
 172 Remember that even if I can figure out how to nudge cryptographers’ tools along,
 173 I am not a professional cryptographer and really don’t know what I’m doing.
 174 But I do think that the cryptography covered here is sound and robust,
 175 to approximately 128 bits of security,
 176 which is frankly pretty extreme overkill for ID scrambling.
 177 *Even if* I have gravely miscalculated on n=10,
 178 I don’t think there’s any way that it could reduce it below even 80 bits,
 179 which would still be significant overkill for the applicable threat model.
 180
 181 I’ve used quite a bit of logic and reasoning here,
 182 which I imagine is a bad sign in cryptographic circles,
 183 where it’s better to use concrete measurements.
 184 I invite analysis by experts whose fancy is tickled.
 185
 186 If I’ve scared you in all of this, I would like to reassure you:
 187 I’ve been handling the cryptography carefully just out of principle,
 188 and in the real world it doesn’t actually matter,
 189 because in the intended use case of this library,
 190 literally no currently‐known or hypothesised attack will *ever* be viable.
 191 Your IDs are safe.
 192
 193 3. Base-32 encoding
 194 ===================
 195
 196 I have chosen this alphabet:
 197
 198    23456789abcdefghijkmnpqrstuvwxyz
 199
 200 That is, ASCII digits and lowercase letters, minus the following characters:
 201
 202  • 0, as visually confusable with O and audially confusable with o
 203    (since zero is extremely commonly pronounced “oh”, just like o);
 204
 205  • 1, as visually confusable with l and I;
 206
 207  • l, as visually confusable with 1 and I; and
 208
 209  • o, as audially confusable with 0.
 210
 211 For some purposes it might be nice to remove more:
 212
 213  • i, as confusable in uppercase with l and 1;
 214
 215  • 5 and s, as mutually confusable in handwriting and uppercase;
 216
 217  • 2 and z, as mutually confusable in handwriting and uppercase; and
 218
 219  • v and u, as mutually confusable in some fonts and handwriting.
 220
 221 But with bit-level block sizes
 222 (and without messing around with more esoteric format-preserving encryption, which I did try),
 223 a power-of-two base yields the best results for performance, code size and compactness of representation,
 224 and I didn’t want to go down to 16.
 225 So base-32 it is, meaning five bits per character.
 226
 227 An interesting consequence of the technical choices is that TESIDs are an even number of characters.
 228 This means that you could even provide an *optional* check digit, if it took your fancy:
 229 even-length string means no check digit, odd-length means check digit.
 230 If you want a check digit, I suggest the Damm algorithm with an order-32 table.
 231 At this time, I offer no assistance in this matter.
 232
 233 When thinking of ID lengths: every five bits is one character,
 234 or more practically (because of the block sizes used) every ten bits two characters.
 235 (Speck can only support even block sizes,
 236 so I couldn’t support odd-length strings without compromising my design a bit,
 237 or finding another cipher as good that could support odd block sizes,
 238 which I failed to do, though I did toy with Hasty Pudding for a while.)
 239
 240 One final obvious question: why base-32, why not a different base?
 241
 242  • To begin with, only ASCII is considered. Non-ASCII would be troublesome.
 243
 244  • If convenient, a power of two is good, for its bit alignment,
 245    leading to a consistent pattern of block sizes and lengths,
 246    easier reasoning, and slightly improved efficiency of execution and length.
 247
 248  • Base-64 is eliminated because it’s not sufficiently human-friendly:
 249
 250    (a) it requires two non-alphanumeric characters in its alphabet,
 251        which messes up wordwise selection as, of ASCII,
 252        underscore is the only nonalphanumeric character that’s “word”.
 253
 254    (b) confusable characters is a problem,
 255        and mixed case an inconvenience.
 256
 257    In its favour, exactly six bits per character *is* nice,
 258    and its evenness means you could fully efficiently support all string lengths,
 259    odd as well as even (with Speck n=3×length).
 260
 261  • Base-62 (alphanumeric) solves the wordwise selection problem,
 262    but doesn’t solve confusables or mixed case,
 263    and loses the neatness of bit alignment.
 264
 265  • In early TESID design I used Base58 (alphanumeric minus 0, I, O, l),
 266    and was considering going down to base-57 (removing 1)
 267    which still yielded 6-/11-character strings for the 32- and 64-bit block sizes I used at the time.
 268    But a couple of people convinced me that case-sensitivity was undesirable,
 269    and once I got into custom block sizes,
 270    the difference in base doesn’t make such a difference in length.
 271
 272  • Base-32 is pretty much as large as you can go while removing all usual confusables.
 273
 274  • Going down perhaps as far as base-25 to eliminate more diverse confusables
 275    (mentioned above) was tempting, but I like the neatness of 32.
 276
 277  • Going down to base-16 (whether hexadecimal or something else)
 278    *would* solve all the problems identified,
 279    including even minimising concerns about undesirable patterns in strings,
 280    but I’m not fond of the longer IDs it leads to:
 281    a 40-bit value is 10 characters,
 282    compared to base-32’s 8.
 283    (And with a little sparsity or type discrimination,
 284    a 40-bit value isn’t unreasonable.)
 285    Also there’s something I can’t quite place my finger on where hexadecimal just feels… *programmery*.
 286    The base is too small to feel right in some way I can’t properly express.
 287    A larger base somehow ends up feeling more human.
 288    (Don’t take from this that programmers are not human.
 289    Unless it amuses you to do so, in which case I won’t try to stop you.)
 290
 291  • Decimal is unsuitable as it tempts numeric treatment,
 292    but leading zeroes are significant due to the cipher range overlaps.
 293    Also it’s just misleading for people, because if they see numbers,
 294    they kinda expect a sequence.
 295
 296 If designing with a different alphabet, different block sizes would make sense.
 297 If encoding to base-*b* and padding to length *l*,
 298 the largest block size that can be used is ceil(log₂(b**l - b**(l - 1)))
 299 (this formula includes rezeroing the block range,
 300 which is useful for some values of b and l).
 301 For l={1, 2, 3, …}, block sizes up to 64 bits:
 302 b=16 yields block sizes 0, 4,  8, 12, 16, 20, 24, 28, 32, 36, 40, 44, 48, 52, 56, 60, 64;
 303 b=20 yields block sizes 0, 5,  9, 13, 18, 22, 26, 31, 35, 39, 44, 48, 52, 57, 61, 65;
 304 b=25 yields block sizes 0, 5, 10, 14, 19, 24, 28, 33, 38, 42, 47, 52, 56, 61, 65;
 305 b=32 yields block sizes 0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65;
 306 b=58 yields block sizes 0, 6, 12, 18, 24, 30, 36, 41, 47, 53, 59, 65;
 307 b=64 yields block sizes 0, 6, 12, 18, 24, 30, 36, 42, 48, 54, 60, 66.
 308 But with Speck, block sizes must be multiples of two,
 309 so any odd-numbered block sizes must be reduced by one.
 310 (That is, a five-character string in base-32 could only go up to 2²⁴, not 2²⁵.)
 311
 312 To my mind, the interesting variants of TESID here are:
 313
 314  • Supporting odd-length strings,
 315    which for base-32 means Speck n={2, 5, 7, 10, 12, 15, 17, 20, 22, …}.
 316    Presuming you still started at four characters (n=10),
 317    I think the values α=9 and β=2 would still be sound.
 318
 319  • Going down to base-16, with Speck n={2, 4, 6, 8, 10, 12, …, 64},
 320    officially supporting the full u128 range of up to 32 characters,
 321    though probably still starting at five characters (n=10).
 322    If you wanted the first 65536 IDs to be less than five characters,
 323    I’d suggest officially dropping below 128-bit security.
 324    (Aside: n=2 is fundamentally incompatible with 128-bit security,
 325    as there are too few possible permutations of a block: 2⁴! < 2¹²⁸.)
 326    The fact that base-16 aligns with bytes is quite nice.
 327    In the Rust implementation, since encode_long would no longer be fallible,
 328    I think I’d drop the u64/u128 divide and go u128 throughout.
 329
 330 Supporting odd-length strings does disclose more epochs,
 331 but it doesn’t matter, especially not when you can tweak all that with⸺
 332
 333 1. Sparsity and discrimination
 334 ==============================
 335
 336 This is a simple extra step that can be added before encryption,
 337 in order to make valid IDs harder to guess,
 338 and to avoid overlap between IDs of different types.
 339
 340    id → id × sparsity + discriminant
 341
 342 There are two purposes for sparsity:
 343
 344 1. Reducing the occupation of the ID space.
 345    If you don’t do this, then by the time you have a million or so IDs assigned (all the ones up to 2²⁰),
 346    *every* four-character string made up of base-32 characters will correspond to a valid ID.
 347    This can be undesirable, for example if you would like to use IDs as authentication,
 348    or if you just plain don’t want valid IDs to be guessable.
 349
 350    (“Using IDs as authentication”: that is,
 351    having the ID is sufficient to access and/or mutate an entity.
 352    This means things like magic login links,
 353    and unlisted videos on a site like YouTube.)
 354
 355    Given a 64-bit ID and 64-bit discriminant, the largest value that can safely be used for sparsity is 2³⁶ − 1.
 356    I suggest using powers of two (assigning a certain number of bits), but you don’t have to.
 357    If you allot 36 bits of sparsity (the most that can safely be done with 64-bit IDs,
 358    as 36 + 64 = 100 which is the largest block size supported by the 5peck ciphers,
 359    though in practice nothing will fill even 50 bits of that 64-bit ID,
 360    so with care you can go higher than 36 if you want),
 361    then only one in every 68 billion possible IDs will be valid,
 362    and it’s very probably reasonable to use IDs as authentication.
 363    With 36 bits of sparseness, you’d get 16 8-character IDs,
 364    then 16384 10-character IDs, then 16 million 12-character IDs, &c.
 365
 366 2. Avoiding ID reuse across types or tenants, by adding a discriminant.
 367    Without this, a given ID number will produce the same TESID,
 368    which could lead to type confusion,
 369    allow discovery of valid IDs of different types
 370    (or on different tenants, or whatever),
 371    or disclose details about types’ sequences.
 372    With this applied correctly and consistently,
 373    TESIDs will effectively represent the encryption of the (type, id) tuple,
 374    instead of just the ID number.
 375    Because of this, a decoder can also determine which type an ID pertains to.
 376
 377 If you are going to use TESID for more than one type,
 378 I recommend adding bits for a type discriminant.
 379
 380 As for how much sparsity to add, decide that yourself.
 381 Every bit you add increases the amortised ID length by 0.2 characters.
 382 (Or, every ten bits you add increases the ID length by exactly 2 characters.)
 383
 384 Here’s a reference table for some possible values,
 385 choosing powers of two (though you’re not limited to it):
 386
 387 ┌─────────────┬────────┬───────┬──────────┬────────────┬─────────────┬─────────────┬───┐
 388 │   Sparsity  │ (Bits) │ Extra │ № 4-char │  № 6-char  │  № 8-char   │  № 10-char  │ … │
 389 │             │        │ chars │   IDs    │    IDs     │    IDs      │     IDs     │ … │
 390 ┝━━━━━━━━━━━━━┿━━━━━━━━┿━━━━━━━┿━━━━━━━━━━┿━━━━━━━━━━━━┿━━━━━━━━━━━━━┿━━━━━━━━━━━━━┿━━━┥
 391 │           0 │    0   │  0    │   1024   │ ~1 million │  ~1 billion │ ~1 trillion │ … │
 392 │          32 │    5   │  1    │     32   │   32736    │ ~32 million │ ~32 billion │ … │
 393 │         256 │    8   │  1.6  │      4   │    4092    │  ~4 million │  ~4 billion │ … │
 394 │        1024 │   10   │  2    │      1   │    1023    │  ~1 million │  ~1 billion │ … │
 395 │       65536 │   16   │  3.2  │      0   │      16    │    16368    │ ~16 million │ … │
 396 │  4294967296 │   32   │  6.4  │      0   │       0    │      256    │    261888   │ … │
 397 │ 68719476735 │ 35.99… │  7.2  │      0   │       0    │       16    │     16368   │ … │
 398 └─────────────┴────────┴───────┴──────────┴────────────┴─────────────┴─────────────┴───┘
 399
 400 Values above 8 are not particularly likely to be useful for type discrimination,
 401 but could be for other forms of discrimination (e.g. shared tenantry),
 402 or for pure sparseness (especially for authenticatory IDs).
 403
 404 The discriminant should generally be smaller than sparsity;
 405 otherwise the ID reuse problem will occur,
 406 and type-unaware decoding will no longer be able to retrieve the discriminant in full.
 407 But this is not checked, partly for reasons of efficiency,
 408 and partly because an overly-large discriminant can be used legitimately,
 409 as a simple offset to lengthen IDs (though sparsity and skipping ID 0 is probably generally a better tool).
 410
 411 I’ve introduced the concept of typed TESID coders to the libraries.
 412 TESID offers no assistance in maintaining discriminant values for your types;
 413 that you must maintain yourself.
 414
 415 It’s possible to achieve these effects in the database,
 416 by adjusting the start (discriminant) and step (sparseness) of the sequence.
 417 TESID provides this functionality anyway for three reasons:
 418
 419 (a) Most databases don’t provide the tools to readily adjust sequence stride;
 420
 421 (b) Your database is probably limited to 63 bits,
 422     but TESID performs these operations in 100-bit space,
 423     allowing sparseness and discrimination even on implausibly large ID values; and
 424
 425 (c) You may be applying TESID to an existing sequence.
 426
 427 —⁂—
 428
 429 I identify three main parameter value groups:
 430
 431 • Big sparsity, no or small discriminant: longer, much less guessable IDs.
 432   Go far enough and this can be reasonable for authenticated:
 433   at 2³⁰, you’re adding six characters to the string length,
 434   and only one in a billion IDs will be valid.
 435   And you can go further still if you really want.
 436
 437 • Big discriminant, no sparsity: just longer IDs, but fully guessable once you have enough.
 438   If you go big enough, this can give the perfect illusion of a fixed-length string:
 439   with a u64 ID, you can safely use 2⁶⁰ as a discriminant and get alwayss-14-character TESIDs;
 440   these will *in practice* be quite suitable for use as authentication.
 441   But you have to be careful in choosing the big discriminant to use,
 442   since the strings *are* still variable-width,
 443   and full subranges are perfectly guessable.
 444   All up, I don’t generally recommend this form.
 445
 446 • Small sparsity, type discriminant: slightly longer and slightly less guessable IDs.
 447   This is a good choice for most purposes.
 448
 449 (When I say “small sparsity”, I suggest values like 32, 256 or 1024.
 450 You don’t *have* to line up with bits, but I mildly prefer to.)
 451
 452 —⁂—
 453
 454 Sequence ideas:
 455
 456 • Auto-incrementing database ID. This is the obvious and most likely use.
 457
 458 • Time. If you used a unix timestamp in seconds, you will get 8-character IDs until the year 36812 which is hopefully enough for you. (Times before 2004-01-11T00:37:04 will be six characters, and from 1970-01-01T00:00:00 to 1970-01-13T13:16:15 four.) If you use YYYYMMDDHHMMSS, you’ll get 10-character IDs; subtract 2000 from the year and go with YYMMDDHHMMSS, and you’re back to 8 characters until the end of 2219.
 459
 460   Practical example that I am thinking of using: private but sharable (that is, the URL is the only authentication) pages on my website where I add a random token to the slug using the publication date.