Skip to main content

Internet-Draft submission

Submission status: Cancelled

This submission has been cancelled, modification is no longer possible.

Submission checks

Your Internet-Draft has been verified to pass the submission checks.

Meta-data from the submission

Note: The meta-data shown for a cancelled draft may be incorrect or incomplete.
Document draft-ray-radx
Revision 00
Group Individual Submission
Document date 2018-02-17
Submission date 2018-02-16
Title Low Overhead Compression For Short String Data
Author count 0 authors
Abstract Compression methods primarily rely on detecting patterns of redundancy
in text strings, which means that a minimum of around 800 bytes of data
are required before compression can occur. Most data fields in data-
bases are far smaller than that, and therefore dictionary-based compres-
sion is not efficient. The method proposed begins by making assumptions
about the data and accepting some compromises; which allows compression
to begin from the very first byte. The discussion proves that this can
result in 1) an average 30% compression, 2) a maximum 50% compression,
and 3) no expansion at all (0%) of the input data. As an added benefit,
all processing can be performed using bit shifting and masking, which
drastically lowers the cost of compression; no division or multiplica-
tion operations are necessary. UTF-8 text is handled gracefully.

OVERVIEW AND BACKGROUND

The origin of RadX is the attempt by Digital Equipment Corporation in
the 1970's to pack more text into the limited memory space available to
computers in that era, particularly as DEC was supporting different
classes of machines with different word sizes - some, like the PDP/8
series, were byte machines, but some used 32 or even 36 bits.

DEC engineers basically used radix40 as the basis for a compressed
character set which included the alphabet, the numerals, the space, and
three other special characters (which varied by machine). They called it
"RAD50", because in that era DEC were into octal arithmetic and every-
thing was in three-bit groups. It worked - they packed three letters
into every 16 bits (40 x 40 x 40 = 64000), but the glaring problem was
that it didn't work for anything but filenames - the character set was
too small. There were also performance issues. Using non-binary bases
for character sets is not kind to ALUs; lots of division, modulus and
multiplication are involved.

Those who designed UTF-8 to represent Unicode in a flexible and fool-
proof way had a similar problem. They decided on an approach in which
all transformations could use bit-shifting instead of divide and multi-
ply. We have done the same, which implies, of course, sticking to binary
bases and radices.

DISCUSSION

The best compression methods seek out redundancy in text - but in order
to do that, the analyzer needs a sufficiently large sample, which some
experts have determined needs to exceed 800 bytes for random text. The
most elegant solutions generate dictionaries on-the-fly, requiring only
one pass through the data. If this method does not serve, then the best
alternative is to make some assumptions about the data before it is ever
seen, perhaps even restrict the universe of permitted values, and then
dive right into a compression procedure from the very first byte.

UTF-8 solved many encoding problems - backward compatibility with ASCII
being the most important, but no less important is the ability to
recover from a bad byte or sequence error, to skip illegal segments
which may have come from some other Unicode encoding scheme, and so on.
But the recognition that an algorithm cannot be arrogantly wasteful of
CPU cycles is what has made it possible to utilize it without having to
worry about the cost.

IMPLEMENTATION

RadX is a useful combination of Radix-32/64 compression. The data is
primarily stored in Radix-32, which allows bit-shifting instead of divi-
ding/multiplying to greatly simplify computation. This is combined with
a check for double characters, which can be encoded effectively in 4
bits. Other ASCII characters are stored in 6 bits; UTF-8 characters are
reduced to bare Unicode and stored in 24 bits maximum. A series of zero
bytes is used for padding.

As the input string is scanned from left to right, all disallowed Uni-
code ranges are ignored. Control characters are treated as spaces, and
all spaces consolidated. If the character is UTF-8, the Unicode page
number is simply substituted for pages 00 to 0F. For codes from U+0080
to U+0FFF, only the last 8 bits need to be added, so an extra byte is
used. For U+1000 onwards, two extra bytes are encoded. For safety,
certain bit patterns are detected and skipped.

If the character is not UTF-8, then it is simple ASCII - and Radix-32
compression is attempted. A special subset of 32 characters is defined
as:

" .0123456789ABCDEFGHILMNOPRSTUWY"
These are space, period, the numerals, and the alphabet with the excep-
tion of J, K, Q, V, X, and Z. The table indexing automatically upper-
cases alphas and substitutes some punctuation. If doubled codes are
found, Radix-64 encoding is used with a "double" flag; if three Radix-
32 encoding are found, that set is used; else a single Radix-64 byte is
created. The bit allocation table below shows the final encodings.

Bit allocation
==============
0 n xxxxxx Radix-64 ASCII, n=0 if doubled, 1 if single
0 000 0000 +2 bytes Unicode U+1000 to U+D7FF (or padding)
0 011 1xxx +1 byte Unicode U+0080 to U+07FF
0 111 xxxx +1 byte Unicode U+0800 to U+0FFF
0 111 1110 +1 byte Unicode repeats, diff. in last digit
0 111 1111 Identical repeat of entire last sequence
1 xxxxx yyyyy zzzzz Radix-32 triplet

The great advantages of this system over DEC RAD50 are that 1) the
computational overhead is low, since there is no need to use division
and modulus repeatedly for each character - simple bit-shift is suffi-
cient, and 2) there are no unmappable characters. The Internet
requires UTF-8 support, and with RadX we can have our cake and eat it
too; we can compress ASCII efficiently and yet not get fazed when UTF-8
unexpectedly pops up (or is introduced on purpose).

In fact, quite a lot of UTF-8 is also compressed - Indic languages, for
instance, such as Hindi and Tamil are also compressed by 30%. The
following ranges are mapped: U+0080 to U+024F, and U+0300 to U+D7FF.
Codes outside those ranges are deleted. Codes from U+0080 to U+0FFF
are mapped using one extra byte; from U+1000 onwards, two extra bytes
are required. This still represents a savings, or at least no penalty:

A. A UTF-8 character is found which is near or identical to the
previous one. It is now stored in 8 bits.
Compression = 3:1 to 2:1 (66% to 50%)
B. Double ASCII character pair found, map to Radix-64. 2 bytes are
stored in 8 bits. Compression = 2:1 (50%)
C. 3 characters found which map into Radix-32. 3 bytes are stored
in 16 bits. Compression = 3:2 (33%)
D. A UTF-8 character in the range U+0800 to U+0FFF is found; it
occupies 3 byte in UTF-8. It is now stored in 16 bits.
Compression = 3:2 (33%)
E. An ASCII character is found, and mapped to Radix-64. 1 byte is
stored in 8 bits. Compression = 1:1 (0%)
F. A UTF-8 character is found in the range U+0080 to U+07FF; it
occupies 2 bytes and is stored in 16 bits. Compression = 2:2 (0%)
G. A UTF-8 character in the range U+1000 to U+D7FF occupies 3 bytes
and is encoded in 24 bits. Compression = 3:3 (0%)

That takes care of all the possible characters in ASCII and UTF-8. Any
dangerous Unicode sequences are deleted as shown below. All the
encodable sequences are first checked to see if they match the previous
Unicode character; if they do, the single-byte Unicode repeat code is
used.

Bit Transformation Sequence (in processing order)
=================================================
0 ??????? -> indexed substitution table
zzzzzz -> 00 zzzzzz [Note A]
xxxxx yyyyy zzzzz -> 1 xxxxx yy yyy zzzzz [Note B]
zzzzzz -> 01 zzzzzz [Note C]
10 ?????? -> delete [orphan continuation byte]
1100 000 -> delete [Note D]
1100 0 x yy 10 yy zzzz -> 0011 100x yyyy zzzz [Note E]
1100 1000 10 yy zzzz -> 0011 1010 00 yy zzzz [Note F]
1100 1001 10 00 zzzz -> 0011 1010 0100 zzzz [Note G]
1100 10 ?? -> delete [Note H]
110 xxx yy 10 yy zzzz -> 0011 1xxx yyyy zzzz [Note J]
1110 0000 10 0????? -> delete [illegal padding]
1110 0000 10 xxxx yy 10 yy zzzz -> 0111 xxxx yyyy zzzz [Note K]
1110 111 ? -> delete [U+E000 - U+FFFF]
1110 aaaa 10 xxxx yy 10 yy zzzz -> 0000 0000 aaaa xxxx yyyy zzzz [No.L]
???? ???? -> delete [U+10000 and above]

Notes: [A] ASCII doubled char in Radix-64
[B] 3 ASCII chars in Radix-32
[C] ASCII single char in Radix-64
[D] U+007F or less - hidden ASCII
[E] U+0080 to U+01FF
[F] U+0200 to U+023F
[G] U+0240 to U+024F
[H] U+0250 - U+02FF - custom chars
[J] U+0300 - U+07FF
[K] U+0800 - U+0FFF
[L] U+1000 to U+DFFF

The substitution table below allows the alphabetic characters in Radix-
64 to be visually recognized as such in data files, so that there is
some ability to verify directly. The Radix-32 set is mostly aligned
with it to simplify debugging.

In Radix-64, only codes 1 to 55 are used for ASCII; 0 and 56 to 63 are
reserved for Unicode encoding prefixes.

Radix-64 and Radix-32 Character Encoding Table
==============================================
Char Rad32 Rad64 Char Rad32 Rad64 Char Rad32 Rad64
==== ===== ===== ==== ===== ===== ==== ===== =====
<space> 0 32 @ 55 ` 39
! 10 33 A 1 1 a 1 1
" 34 B 2 2 b 2 2
# 35 C 3 3 c 3 3
$ 36 D 4 4 d 4 4
% 37 E 5 5 e 5 5
& 38 F 6 6 f 6 6
' 39 G 7 7 g 7 7
( 27 H 8 8 h 8 8
) 29 I 9 9 i 9 9
* 40 J 10 j 10
+ 41 K 11 k 11
, 42 L 12 12 l 12 12
- 43 M 13 13 m 13 13
. 10 10 N 14 14 n 14 14
/ 28 O 15 15 o 15 15
0 11 44 P 16 16 p 16 16
1 17 45 Q 17 q 17
2 22 46 R 18 18 r 18 18
3 24 47 S 19 19 s 19 19
4 26 48 T 20 20 t 20 20
5 27 49 U 21 21 u 21 21
6 28 50 V 22 v 22
7 29 51 W 23 23 w 23 23
8 30 52 X 24 x 24
9 31 53 Y 25 25 y 25 25
: 54 Z 26 z 26
; 54 [ 27 { 27
< 27 \ 28 | 28
= 43 ] 29 } 29
> 29 ^ 30 ~ 30
? 10 33 _ 31

The reverse process is even simpler. A series of zeroes flags the end
of the text, so processing can stop once that is encountered. Every
other possible combination represents a compression mode that can be
decoded and expanded.

RadX To UTF-8 Regeneration Sequence (in processing order)
=========================================================
0000 0000 0000 ???? ???? ???? -> delete [space padding]
0000 0000 aaaa xxxx yyyy zzzz -> 1110 aaaa 10 xxxx yy 10 yy zzzz
0011 0xxx yyyy zzzz -> 110 xxx yy 10 yy zzzz
0011 1110 zzzz ZZZZ -> rpt prev UTF-8, subst last digit
0011 1111 -> repeat previous sequence
0011 1xxx yyyy zzzz -> 1110 0000 10 1xxx yy 10 yy zzzz
00 zzzzzz -> rrrrrrrr rrrrrrrr [reverse Radix-64]
01 zzzzzz -> rrrrrrrr [reverse Radix-64]
1 xxxxx yy yyy zzzzz -> rrrrrrrr rrrrrrrr rrrrrrrr
[reverse Radix-32]

The fourth sequence above is used when the current UTF-8 and also the
next one, are almost the same as the last but differ in the final hex
digit. The fifth sequence above is used when the entire current
encoding is identical to the last.

Special processing is done for word beginnings which are lower case -
the preceding Radix-64 space character's "double" bit is set to
indicate that the next letter should NOT be uppercased (0x20 vs. 0x60).
(Double spaces are impossible, since they are eliminated before
anything else.)

The sequence 0x800? cannot normally occur, since that would represent
the Radix-32 sequence for two spaces in a row; so this allows for a set
of 32 special codes which have not been allocated, any may be used for
future extension.

USAGE EXAMPLE

And finally, here is a worst-case scenario with the famous full-alpha-
bet sentence.

T H E Q U I C K B R O W N
..20+8+5... 32 17 ..21+9+3... 11 ..0+2+18... ..15+23+14..

F O X J U M P S O V E R
..0+6+15... 24 32 10 ..21+13+16.. ..19+0+15.. 22 ..5+18+0

T H E L A Z Y D O G .
..20+8+5... ...0+12+1... 26 ..25+0+4... ..15+7+10..

The three-number groups represent a two-byte condensation in Rad-32,
and the lone numbers represent a fallback to Rad-64 using one byte.
The original 44-byte sentence is reduced to 32, or a 27% reduction even
without any doubled characters.

This letter frequency assumption works for all European languages, with
a slight loss for German, Italian and Polish.

RESULTING FEATURES

The result of applying the above procedures to text strings gives the
following advantages:
Page count 6
File size 16.6 KB
Formal languages used C Code
Submission additional resources None

Submitter information

Name Ray de Silva
Email address raydesilva@hotmail.com

History

Date By Event
2018-02-23 Cancelled submission
2018-02-19 Sent full access URL to: Ray de Silva <raydesilva@hotmail.com>
2018-02-18 Sent full access URL to: Ray de Silva <raydesilva@hotmail.com>
2018-02-16 Edited rev, formal languages, abstract, submitter and sent request for manual post
2018-02-16 Uploaded submission
Please send reports about submission tool bugs to the Tools Team using one of the Bug Report links at the bottom of the page.
If you run into problems submitting an Internet-Draft and need to request manual posting of an Internet-Draft, please send the Internet-Draft and the reason for manual posting to support@ietf.org. Be advised that manual processing always takes additional time.