draft-main-magic-00

Network Working Group                                            A. Main
Internet-Draft: draft-main-magic-00                        Black Ops Ltd
Category: Best Current Practice                             October 2001
Expires: April 2002


                   Care and Feeding of Magic Numbers

Status of this Memo

   This document is an Internet-Draft and is subject to all provisions
   of Section 10 of RFC2026.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups.  Note that
   other groups may also distribute working documents as Internet-
   Drafts.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/1id-abstracts.html

   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html

Abstract

   This memo describes techniques for the use of magic numbers in a
   multimedia context, for the in-band identification of digital file
   formats.  Specific recommendations are made concerning the use of
   magic numbers in newly developed file formats.

1 Introduction

   There have historically been four main ways to determine the format
   of a digital data object.  In decreasing order of desirability, they
   are:

   1. Explicit indication in metadata accompanying the object.  (E.g.,
      the "Content-Type" header in a MIME message indicates the format
      of the body [MIME-MSG].)

   2. From context, i.e., the way in which an object is being used.
      (E.g., passing a file to the `gunzip' program indicates that the



Main                       expires April 2002                   [Page 1]


Internet-Draft      Care and Feeding of Magic Numbers       October 2001


      file should be in `gzip' format.)

   3. Inference from examination of the data: different data formats
      look different.

   4. Implicit indication from the name under which a file is stored, in
      contexts where it is conventional to name files in a way that
      indicates their format.

   In operating systems that do not keep type metadata with a file,
   method 1 is not usually possible.  For example, in Unix all files are
   typeless octet strings.  In such operating systems, the collective
   wisdom has been to use a combination of methods 2 and 3 to support
   each other.

   More generally, out-of-band identification mechanisms (1, 2, and 4)
   are often not possible, not least because metadata tends to become
   detached from primary data.  In-band identification (method 3) is the
   only file format identification mechanism that it is always possible
   to attempt.

   Because many non-textual file formats include some kind of fixed-
   format header, method 3 usually consists of examination of the
   beginning of the object to see what its header looks like.  A
   convention has arisen of aiding this type of format identification by
   including in file formats header fields whose primary purpose is to
   assist in identifying the file format.  These are known as `magic
   numbers'.

   Although the MIME system uses explicit type indication throughout,
   those developing MIME recognised the utility of other means of
   recognising file formats.  [MIME-REG] section 2.2.9 encourages MIME
   media type registration documentation to include details of magic
   numbers and file naming conventions, among other optional data.
   Experience has shown the wisdom of this recommendation: it is not
   uncommon that, once a digital object has left the control of
   metadata-preserving MIME-based Internet protocols, its attendant type
   information is discarded in one way or another.  There is also a
   problem in many cases when a file enters the realm of MIME-based
   protocols, of attaching the correct MIME type metadata.

   With the recent general increase in the popularity of multimedia
   applications, and the corresponding proliferation of new media types,
   magic numbers are becoming more widely useful than ever.  It
   therefore seems prudent to offer to the Internet community at large
   the common wisdom among Unix software engineers concerning magic
   numbers.




Main                       expires April 2002                   [Page 2]


Internet-Draft      Care and Feeding of Magic Numbers       October 2001


1.1 Requirements Terminology

   The key words "MUST", "MUST NOT", "REQUIRED", "SHOULD", "SHOULD NOT",
   "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be
   interpreted as described in [REQ-TERM].

1.2 Numerical Notation

   In this document, all numbers are given in decimal, except where
   otherwise indicated.  A prefix "0x" indicates hexadecimal.

2 Use of Magic Numbers

   There are two basic purposes for which a magic number can be used: to
   guess the format of a file where it is not previously known, and to
   confirm a file format that was specified by other means.  Other modes
   of use are combinations or extensions of these two.

2.1 Confirmation of File Format

   In the context of file format confirmation, a magic number isn't
   magical at all: it's simply a header field with fixed contents.  Any
   file without the magic number trivially fails a header validity
   check.  A magic number test used to confirm the file format is thus
   merely a partial file validity check, performed on a header field
   intended specifically for the purpose.

   By its nature, a magic number test is not a complete test of the
   validity of the data in a file.  Even a complete syntactic check of a
   file cannot conclusively confirm that the file was originally
   intended to be interpreted in the file format that it is presently
   purported to be in.

   Therefore, strictly speaking, a magic number cannot positively
   confirm identification of a file format; it can only with certainty
   negate an identification.  The usefulness of a magic number for the
   purpose of file format confirmation comes from its ability to provide
   a high degree of confidence in the format identification.  The degree
   of confidence in a confirmation is directly correlated with the
   degree of probability that the magic number will not be found in a
   file of any other format.  This confidence can therefore be increased
   by carefully engineering the magic number to increase the probability
   of correctly detecting a file format mis-indentification.  See
   section 3 for more information.

2.2 Guessing File Format

   Magic numbers aren't magic: they can't generate a file format



Main                       expires April 2002                   [Page 3]


Internet-Draft      Care and Feeding of Magic Numbers       October 2001


   identification ex nihilo.  All they can really do is confirm an
   existing guess about a file format, and even that is only
   probabilistic, as described in the previous section.  Therefore, the
   process for guessing a file format using magic numbers consists of
   testing the file against a series of possible file formats' magic
   numbers to see which it matches.

   It is beneficial, in any context where correct file type
   identification is important, to minimise the number of file formats
   considered: the confidence in a magic-number-based file type
   identification depends on the chance of a file of unknown type having
   none of the magic numbers considered, and this chance decreases as
   the number of magic numbers used increases.  The range of file
   formats to test against is inherently a context-dependent choice:
   most contexts will have a small number of meaningful file formats to
   be considered.

   For example, Unix operating systems have traditionally used magic
   numbers in the headers of their executable program file formats.
   When asked to execute a file, the operating system tests the
   purported executable against the various executable format magic
   numbers it knows about.  When a match is found, this provides some
   assurance that the purported executable is indeed an executable
   appropriate for this system, as well as determining which type of
   executable file it is and therefore how to go about executing it.
   Incidentally, as a result of this procedure, the error message for an
   improperly-formatted executable file has on some versions of Unix
   been "Bad magic".

   There is a problem with the basic technique for guessing file formats
   based on magic numbers: it is possible for a single file to match the
   magic number requirements of more than one file format.  In such a
   case, magic numbers would give no further insight into the file type.
   Among a set of cooperating file formats it is possible to completely
   avoid this problem by making their magic number requirements mutually
   incompatible; this is trivially achieved by giving them different
   magic number values to be stored at the same location in the file.

2.3 Detection of Corruption

   Magic numbers can be used to detect certain types of mangling of the
   data in a file, giving early indication that the data of interest in
   the file is not intact.  This is really a special application of the
   technique of confirming an expected file format.  Although file
   corruption due to transmission errors is now almost entirely a thing
   of the past, there are still some types of corruption that can occur
   due to mistakes, and that magic numbers can help to detect.




Main                       expires April 2002                   [Page 4]


Internet-Draft      Care and Feeding of Magic Numbers       October 2001


   Transmission of binary file formats through paths intended for text
   is as much a problem as it ever was.  In addition to some octet
   values that just aren't handled by text gateways, and the
   historically-known problems of text being reformatted en route, this
   kind of misconfiguration can subject a file to unwanted character set
   conversions and newline format conversions.  A magic number,
   particularly if it contains non-ASCII octet values, is likely to be
   damaged by such conversions.

   Magic numbers can also help to detect endianness errors.  If a magic
   number is read as a numeric field, and the reader is interpreting
   numeric fields using a different endianness from that with which the
   file was written, then the magic number will appear to be incorrect,
   thus avoiding a potential silent misinterpretation of the rest of the
   file.

   The magic number for the PNG image format [PNG] takes this usage of
   magic numbers to an extreme.  4 of the first 8 octets of the PNG file
   format are intended, at least in part, to detect text-related
   manglings.

3 Putting Magic Numbers into New File Formats

   As should already be apparent, it is useful in several situations to
   have some chance of successful in-band file format identification.
   To this end, each new file format where it may be useful SHOULD have
   some kind of magic number.

   Magic numbers are useful in digital data objects, including
   particularly media objects, that are expected to be visible in more
   than one context.  Formats for objects with more specialised use,
   such as the packets of a networked protocol, have less need for magic
   numbers.  However, the same magic number techniques can still be
   reasonably used in such cases if there is no conflicting requirement,
   for example to make the packet as small as possible.

   A basic requirement for the usefulness of magic numbers is that
   different file formats with magic numbers MUST have different magic
   numbers.  New magic number values SHOULD be completely unrelated to
   pre-existing magic numbers.

   It is common for the magic numbers of related file formats to be
   chosen to be similar, for example by having adjacent numeric values.
   Doing this reduces the effectiveness of the magic numbers, by making
   it more likely that arbitrary data of some other type will match one
   of the range of magic numbers.  The chance of accidental collision of
   magic number with magic number or magic number with real data is
   minimised by having magic numbers for different file types be



Main                       expires April 2002                   [Page 5]


Internet-Draft      Care and Feeding of Magic Numbers       October 2001


   completely unrelated, and this is therefore RECOMMENDED.

3.1 Magic Numbers in Binary File Formats

   This section applies to file formats where the underlying format
   consists of a string of bits, which for convenience we divide into
   octets.

3.1.1 Recommended Placement

   It is desirable that as many file formats as possible should be
   mutually incompatible.  This is achieved by them having different
   magic numbers at the same location within the file.  By far the most
   common location for a magic number within a file is the very
   beginning, offset zero.  This is also the most logical location for
   it, and also the easiest to read from.

   Therefore, any new binary file format SHOULD place its magic number
   at the very beginning of the file, offset zero.

3.1.2 Recommended Length

   Historically, many magic numbers have been very small, often only 2
   octets.  At the time of writing, the most popular size is 4 octets.
   Both of these sizes are rather small.  There is almost never a real
   need to save a few octets in a file header; mass storage is orders of
   magnitude cheaper than it was when 2-octet magic numbers were
   popular.  It seems more worthwhile to spend a few more octets to
   minimise the likelihood of accidental magic number collision.

   Therefore, considering the increasing popularity of 64-bit computing
   hardware, new magic numbers SHOULD be 8 octets (64 bits) in length.

   Note: being a file compression format is no excuse for skimping on
   the magic number!  If an object being compressed is so small that an
   extra few octets of magic number is really significant, then
   compression overheads will probably render the compression unuseful
   anyway.

3.1.3 Nature of the Magic Number

   Any usable file format specification should specify the layout of a
   file right down to the octet level, and the magic number field is no
   exception.  It is not sufficient to merely specify a 64-bit (or
   however large) number and state that it is stored at a particular
   offset within the file; it is necessary to specify exactly what the
   octet values are in the magic number field.  Of course, if a file
   format specification first establishes a convention for the



Main                       expires April 2002                   [Page 6]


Internet-Draft      Care and Feeding of Magic Numbers       October 2001


   representation of numerical fields (big endian, little endian, or
   anything else), then simply specifying a large number to place in the
   magic number field will be sufficiently unambiguous.

   To summarise: file format specifications MUST specify the contents of
   the magic number field sufficiently clearly to determine the exact
   sequence of octets that fill that field.  This specification SHOULD
   be in the form of an explicit list of octet values.

3.1.4 Selecting a Magic Number

3.1.4.1 Requirements

   The basic requirement on a magic number is that it look different
   from as many other file formats as possible.  This can be divided
   into two requirements: it should be different from all other magic
   numbers, and it should look different from non-magic-number data
   formats (principally text formats).

   There is a popular but misguided technique of selecting meaningful
   ASCII character values to make up a magic number.  For example, a
   popular Unix archival file format uses the ASCII characters "!<arch>"
   as its magic number.  This kind of magic number is very poor, because
   by definition the magic number test can be satisfied by a plain ASCII
   text file.  In many cases, the sequence of characters chosen has been
   one particularly likely to occur naturally at the beginning of a text
   file.  There is, of course, no technical requirement for the first
   few octets of a binary-format file to contain text characters.

3.1.4.2 Magic Numbers for Related File Formats

   Historically, some file formats have been deliberately ambiguous
   about octet ordering in numerical fields.  They have used the native
   ordering on whatever system the file was intended to be used on.
   Often the magic number field was handled the same way: it was a
   numeric field, written in the native numeric format, and so
   recognition of the magic number indicated implicitly that the reader
   was reading in the right numeric format.  Another view is that such
   file formats actually define two (or more) variant file formats,
   differing only in numeric format and in the contents of the magic
   number field.  This leads to the use of the magic number field to
   detect the numeric format that should be used to interpret the rest
   of the file.  Designing file formats like this is not recommended,
   but the accompanying magic number technique is good.

   Where a file format has variants that, apart from the magic number,
   differ only in the format of numeric fields, the contents of the
   magic number field MAY be varied in the same way, but in any case



Main                       expires April 2002                   [Page 7]


Internet-Draft      Care and Feeding of Magic Numbers       October 2001


   MUST vary in some way.

3.1.4.3 Recommended Selection Criteria

   To give the best possible chance of a magic number being different
   from other magic numbers, and to look as little like other structured
   data formats as possible, magic numbers SHOULD be selected randomly.
   Randomness of cryptographic strength is not necessary, but the
   randomness source should be statistically unbiased.

   To avoid accidentally generating a magic number that happens to look
   like a textual file format or is in other ways weak, randomly
   selected magic numbers SHOULD be filtered according to the following
   criteria:

   o  There should be no adjacent identical octets.  Non-random data is
      relatively likely to have such patterns, and this requirement also
      ensures that the magic number can't possibly be unchanged if the
      file is improperly byte-swapped or similarly mangled.

   o  At least 50% of the octets should have the most significant bit
      set.  This ensures that the magic number cannot be mistaken for
      ASCII text, and is highly unlikely to look like text in any ASCII
      extension character set (such as ISO-8859-1), where most of the
      text tends to be in the ASCII range.  It also ensures that
      mangling that strips off the most significant bit of of each octet
      will be detected.

   o  At least 75% of the octets should be outside the ASCII printable
      range.  This minimises the chance of clashing with an ASCII-
      compatible character set.

   o  There should be at least one octet in the ASCII printable range;
      at least one in the non-ASCII printable range of the ISO-8859
      character sets; and at least one that is a control character in
      the ISO-8859 character sets, other than 0x09, 0x0a, 0x0c, and 0x0d
      (which are the only control characters that commonly occur in
      plain text).

   o  The magic number should not be a valid substring of UTF-8.
      Fortunately UTF-8 is quite highly structured, by design, so it is
      easy to eliminate the possibility of a clash.

   o  The octet-reverse of the magic number should also meet all of the
      above criteria.  This is to support the dual octet ordering
      technique described in section 3.1.4.2.

   These filtering rules provide some 1.16*2^62 acceptable 8-octet magic



Main                       expires April 2002                   [Page 8]


Internet-Draft      Care and Feeding of Magic Numbers       October 2001


   numbers (approximately 29.0% of all 64-bit values), and 1.16*2^29
   acceptable 4-octet magic numbers (14.5% of all 32-bit values).

3.1.4.4 Magic Number Selection Program

   This Perl program can be used to generate high-quality magic numbers
   using the generation rules given in the previous section.

       #!/usr/bin/perl -w
       $length = $ARGV[0] || 8;
       $length >= 4 or die "$0: Magic must be at least 4 octets\n";
       open(STDIN, "/dev/urandom")
           or die "$0: Can't open /dev/urandom: $!\n";
       sub not_utf8($) {
           ($_[0]."\x80\x80\x80\x80\x80") !~ /\A[\x80-\xbf]{0,5}(
               [\x00-\x7f]|
               [\xc0-\xdf][\x80-\xbf]|
               [\xe0-\xef][\x80-\xbf]{2}|
               [\xf0-\xf7][\x80-\xbf]{3}|
               [\xf8-\xfb][\x80-\xbf]{4}|
               [\xfc-\xfd][\x80-\xbf]{5}
           )*\x80{0,5}\z/sx;
       }
       while(1) {
           sysread(STDIN, $magic, $length)
               or die "$0: /dev/urandom: $!\n";
           length($magic) == $length or die "$0: Short read\n";
           # no repeated octets
           $magic =~ /(.)\1/s and next;
           # at least 50% high-half
           $_ = $magic; $high = 0; s/[\x80-\xff]/$high++, "h"/seg;
           next unless $high*2 >= $length;
           # at least 75% not ASCII printable
           $_ = $magic; $asc = 0; s/[\x20-\x7e]/$asc++, "a"/seg;
           next if $asc*4 > $length;
           # at least one ASCII printable
           $magic =~ /[\x20-\x7e]/s or next;
           # at least one high-half ISO-8859 printable
           $magic =~ /[\xa0-\xff]/s or next;
           # at least one ISO-8859 control character
           $magic =~ /[\x00-\x08\x0b\x0e-\x1f\x7f-\x9f]/s or next;
           # not a substring of UTF-8
           not_utf8($magic) or next;
           not_utf8(reverse($magic)) or next;
           last;
       }
       $magic =~ s/(.)/sprintf("0x%02x ", ord($1))/seg;
       $magic =~ s/ $/\n/;



Main                       expires April 2002                   [Page 9]


Internet-Draft      Care and Feeding of Magic Numbers       October 2001


       print $magic;

3.2 Magic Numbers in Textual File Formats

   This section applies to file formats where the underlying format
   consists of a string of characters, which are in turn encoded as
   plain text using some charset.

3.2.1 Recommended Placement

   Similar considerations apply as apply with binary file formats.  The
   most common location, the most logical, and the easiest to read from,
   is the very beginning of the file.  Therefore, any new textual file
   format SHOULD place its magic number at the very beginning of the
   file, offset zero.

3.2.2 Selecting a Magic Number

3.2.2.1 Recommended Selection Criteria

   As with binary magic numbers, textual magic numbers SHOULD be
   selected randomly.  To avoid accidentally generating a magic number
   that happens to look like natural text, randomly selected textual
   magic numbers SHOULD be filtered according to the following criteria:

   o  There should be no adjacent identical characters.  Non-random data
      is relatively likely to have such patterns.

   o  There should be at least one non-alphanumeric character.

   The set of characters from which the magic number is generated
   depends on the requirements of the particular file format: different
   formats have different underlying character sets, and different
   readability and editability constraints.  Magic numbers SHOULD be
   selected from as wide a character set as is possible subject to such
   requirements.

3.2.2.2 Recommended Length

   It is RECOMMENDED that the length of a textual magic number be chosen
   to match the number of magic numbers available in binary formats.
   This length necessarily varies with the character set to which the
   magic number is limited.

   In the case of selecting a magic number from the ISO-646 graphical
   characters, which have the best possible chance of being
   representable in any character set encountered in practice, there are
   82 characters available.  This yields potential information content



Main                       expires April 2002                  [Page 10]


Internet-Draft      Care and Feeding of Magic Numbers       October 2001


   of 6.36 bits per character.  The filtering rules in section 3.2.2.1
   provide some 1.25*2^63 acceptable 10-character magic numbers
   (approximately 84% of all 10-character sequences), and 1.24*2^31
   acceptable 5-character magic numbers (72% of all 5-character
   sequences).

3.2.2.3 Magic Number Selection Program

   This Perl program can be used to generate high-quality textual magic
   numbers using the generation rules given in section 3.2.2.1.  It uses
   only ISO-646 graphical characters, which should be acceptable to the
   widest possible variety of applications; when designing a file format
   that requires non-ISO-646 characters anyway, it may be desired to
   adapt this program to use a correspondingly wider selection of
   characters.

       #!/usr/bin/perl -w
       $length = $ARGV[0] || 10;
       $length >= 1 or die "$0: Magic must be at least 1 character\n";
       open(STDIN, "/dev/urandom")
           or die "$0: Can't open /dev/urandom: $!\n";
       $charset = "ABCDEFGHIJKLMNOPQRSTUVWXYZ".
                  "abcdefghijklmnopqrstuvwxyz".
                  "0123456789!\"%&'()*+,-./:;<=>?_";
       while(1) {
           sysread(STDIN, $magic, $length)
               or die "$0: /dev/urandom: $!\n";
           length($magic) == $length or die "$0: Short read\n";
           $magic =~ s/(.)/ord($1) > 3*length($charset) ? "#" :
               substr($charset, ord($1)%length($charset), 1)/seg;
           $magic =~ /#/ and next;
           # no repeated characters
           $magic =~ /(.)\1/s and next;
           # at least one non-alphanumeric
           $magic =~ /[^A-Za-z0-9]/s or next;
           last;
       }
       print $magic, "\n";

4 Security Considerations

4.1 Magic Numbers as a Validity Test

   As explained in section 2.1, a positive magic number test provides no
   assurance that a file is actually a valid instance of the file format
   it appears to be.  A magic number test is not a substitute for a
   complete syntactic check, and so MUST NOT be relied on as a validity
   test.



Main                       expires April 2002                  [Page 11]


Internet-Draft      Care and Feeding of Magic Numbers       October 2001


4.2 Eavesdropping Considerations

   The presence of a magic number in a file format can give an
   eavesdropper additional clues about the nature of data being
   intercepted, or may give an eavesdropper something convenient to
   search for in intercepted data if they want to find a particular type
   of data.  In any situation where eavesdropping is a concern, the use
   of strong encryption is RECOMMENDED.

4.3 Interaction with Encryption

   Where data is encrypted, the presence of a string of octets of fixed
   value, particularly at the very beginning of a data stream, can
   provide an opportunity for an attacker to apply known-plaintext
   attacks.  Good ciphers are designed to resist such attacks; such
   resistance becomes absolutely essential when dealing with data that
   is as predictable as a magic number.

   In theory, the capability for an attacker to specify a new file
   format that includes a lengthy magic number opens up the possibility
   of a very slow chosen-plaintext attack.  This is made possible by the
   lack of expectation that a magic number be in any way meaningful;
   this is the type of risk that leads cipher and hash algorithm
   designers to use mathematically significant constants instead of
   apparently random values in their algorithms.  This possibility is
   difficult to exploit, and is in most contexts less of a concern than
   direct chosen-plaintext attacks where the attacker chooses the
   content (rather than the form) of data to be encrypted.  In either
   case, the risk is mitigated by the use of good ciphers that are
   designed to resist chosen-plaintext attacks.

   Generally, all these concerns about known patterns in secret data
   already exist in any structured data; a magic number is merely the
   simplest and most extreme case.  Good ciphers, chaining modes, and
   cryptographic protocols are all intended to remain secure under
   situations of partially known or chosen plaintext.

5 Acknowledgements

   Some of the magic number selection rules in section 3.1.4.3 are due
   to Eric S. Raymond.

6 References

   [MIME-MSG]   N. Freed, N. Borenstein, "Multipurpose Internet Mail
                Extensions (MIME) Part One: Format of Internet Message
                Bodies", RFC 2045, November 1996.




Main                       expires April 2002                  [Page 12]


Internet-Draft      Care and Feeding of Magic Numbers       October 2001


   [MIME-REG]   N. Freed, J. Klensin & J. Postel, "Multipurpose Internet
                Mail Extensions (MIME) Part Four: Registration
                Procedures", BCP 13, RFC 2048, November 1996.

   [PNG]        T. Boutell, "PNG (Portable Network Graphics)
                Specification Version 1.0", RFC 2083, March 1997.

   [REQ-TERM]   S. Bradner, "Key words for use in RFCs to Indicate
                Requirement Levels", BCP 14, RFC 2119, March 1997.

7 Author's Address

   Andrew Main
   Black Ops Ltd
   36 Cannon Hill Road
   Coventry
   CV4 7DE
   United Kingdom

   Phone: +44 7887 945779
   EMail: zefram@fysh.org






























Main                       expires April 2002                  [Page 13]

Document	Document type	Expired Internet-Draft (individual) Expired & archived
	Select version	00
	Author	Andrew Main Email authors
	RFC stream	(None)
	Intended RFC status	(None)
	Other formats	txt pdf bibtex bibxml