1.0.0
[sanitise-file-name] / tests / misc.just-remove_control_characters.sanitised
1 These are miscellaneous tests of my own division.
2 _
3 hello_world
4 The quick brown fox jumps over the lazy doggerel.txt
5 well ? then _ . __ how about this ?
6 Once upon a time there was a file name sanitiser; it was a good file name sanitiser, and never exposed security vulnerabilities to the World. “It’s a dangerous place,” its grand. “If a wolf should come out of the forest, then what would you do?”
7 (Some Peter and the Wolf snuck in there.)
8 .hidden
9 C:\WINDOWS\system32\driver\etc\hosts
10 %WINDIR%\system32\driver\etc\hosts
11 Kinda funny how Windows has a /etc/hosts.
12 .
13 ..
14 ...
15 ....
16 /././././././
17 . .. . . . .. ..
18 I’m basically just typing random stuff here.
19 OK, time for some more serious stuff.
20 _
21 For Unicode paths, some file systems limit paths to roughly 255 UTF-8 code units, others to roughly 255 UTF-16 code units. UTF-8 is the tighter of these restrictions in all circumstances: UTF-16 uses one code unit until U+F. Now then: one-byte characters:
22 # One-byte characters:
23 123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345
24 12345678901234567890.abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz
25 123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678.abcdefghijklmnopqrstuvwxyz
26 # Two-byte characters:
27 áɓçđéƒɠɦïķáɓçđéƒɠɦïķáɓçđéƒɠɦïķáɓçđéƒɠɦïķáɓçđéƒɠɦïķáɓçđéƒɠɦïķáɓçđéƒɠɦïķáɓçđéƒɠɦïķáɓçđéƒɠɦïķáɓçđéƒɠɦïķáɓçđéƒɠɦïķáɓçđéƒɠɦïķáɓçđéƒɠ.
28 áɓç.°¹²³°¹²³°¹²³°¹²³°¹²³°¹²³°¹²³°¹²³°¹²³°¹²³°¹²³°¹²³°¹²³°¹²³°¹²³°¹²³°¹²³°¹²³°¹²³°¹²³°¹²³°¹²³°¹²³°¹²³°¹²³°¹²³°¹²³°¹²³°¹²³°¹²³°¹²³
29 áɓçđéƒɠ.°¹²³°¹²³°¹²³°¹²³°¹²³°¹²³°¹²³°¹²³°¹²³°¹²³°¹²³°¹²³°¹²³°¹²³°¹²³°¹²³°¹²³°¹²³°¹²³°¹²³°¹²³°¹²³°¹²³°¹²³°¹²³°¹²³°¹²³°¹²³°¹²³°¹²³
30 áɓçđéƒɠɦïķá.°¹²³°¹²³°¹²³°¹²³°¹²³°¹²³°¹²³°¹²³°¹²³°¹²³°¹²³°¹²³°¹²³°¹²³°¹²³°¹²³°¹²³°¹²³°¹²³°¹²³°¹²³°¹²³°¹²³°¹²³°¹²³°¹²³°¹²³°¹²³°¹²³
31 áɓçđéƒɠɦïķáɓçđéƒɠɦïķáɓçđéƒɠɦïķáɓçđéƒɠɦïķáɓçđéƒɠɦïķáɓçđéƒɠɦïķáɓçđéƒɠɦïķáɓçđéƒɠɦïķáɓçđéƒɠɦïķáɓçđéƒɠɦïķáɓçđéƒɠɦïķáɓçđéƒɠɦïķáɓç.°¹²³
32 # Three-byte characters:
33 ‐‑‒–—―‖‗‘’‚‛“”„‟†‡•‣․‥…‧‐‑‒–—―‖‗‘’‚‛“”„‟†‡•‣․‥…‧‐‑‒–—―‖‗‘’‚‛“”„‟†‡•‣․‥…‧‐‑‒–—―‖‗‘’‚‛“
34 ‐‑‒–—.₁₂₃₄₅₆₇₈₉₀₁₂₃₄₅₆₇₈₉₀₁₂₃₄₅₆₇₈₉₀₁₂₃₄₅₆₇₈₉₀₁₂₃₄₅₆₇₈₉₀₁₂₃₄₅₆₇₈₉₀₁₂₃₄₅₆₇₈₉₀₁₂₃₄₅₆₇₈₉
35 ‐‑‒–—―‖‗‘’‚‛“”„‟†‡•‣․‥…‧‐‑‒–—―‖‗‘’‚‛“”„‟†‡•‣․‥…‧‐‑‒–—―‖‗‘’‚‛“”„‟†‡•‣․‥…‧‐‑.₁₂₃₄₅₆₇₈₉₀
36 # Four-byte characters:
37 𐀀𐀁𐀂𐀃𐀄𐀅𐀆𐀇𐀈𐀉𐀊𐀋𐀍𐀎𐀏𐀐𐀑𐀒𐀓𐀔𐀕𐀖𐀗𐀘𐀙𐀚𐀛𐀜𐀝𐀞𐀟𐀠𐀡𐀢𐀣𐀤𐀥𐀦𐀨𐀩𐀪𐀫𐀬𐀭𐀮𐀯𐀰𐀱𐀲𐀳𐀴𐀵𐀶𐀷𐀸𐀹𐀺𐀼𐀽𐀿𐁀𐁁𐁂.
38 𐀀𐀁𐀂𐀃𐀄𐀅𐀆.𐂀𐂁𐂂𐂃𐂄𐂅𐂆𐂇𐂈𐂉𐂊𐂋𐂌𐂍𐂀𐂁𐂂𐂃𐂄𐂅𐂆𐂇𐂈𐂉𐂊𐂋𐂌𐂍𐂀𐂁𐂂𐂃𐂄𐂅𐂆𐂇𐂈𐂉𐂊𐂋𐂌𐂍𐂀𐂁𐂂𐂃𐂄𐂅𐂆𐂇𐂈𐂉𐂊𐂋𐂌𐂍
39 𐀀𐀁𐀂𐀃𐀄𐀅𐀆𐀇𐀈𐀉𐀊𐀋𐀍𐀎𐀏𐀐𐀑𐀒𐀓𐀔𐀕𐀖𐀗𐀘𐀙𐀚𐀛𐀜𐀝𐀞𐀟𐀠𐀡𐀢𐀣𐀤𐀥𐀦𐀨𐀩𐀪𐀫𐀬𐀭𐀮𐀯𐀰𐀱𐀲.𐂀𐂁𐂂𐂃𐂄𐂅𐂆𐂇𐂈𐂉𐂊𐂋𐂌𐂍
40 _
41 abcdef.ghij
42 abcde.fghij
43 AUX.abcdef
44 lpT7.abcdef
45 cOm6.abcdef
46 CON
47 aux.h
48 Lpt1.exe
49 xyz
50 nül
51 COM1.jpg.png
52 _
53 Some sanitisers try stripping out ZWSP (​), which can be used as a fingerprinting vector and has no particularly legitimate purpose in a file name; I’m not, because removing it doesn’t solve the fingerprinting risk, as you can use ZWNJ and ZWJ (.)