Internet Draft
  Document: <draft-ietf-psamp-sample-tech-04.txt>                T. Zseby
  Expires: August 2004                                   Fraunhofer FOKUS
                                                                M. Molina
                                                          NEC Europe Ltd.
                                                               F. Raspall
                                                          NEC Europe Ltd.
                                                              N. Duffield
                                                     AT&T Labs - Research
  
                                                            February 2004
  
  
     Sampling and Filtering Techniques for IP Packet Selection
  
  
  Status of this Memo
     This document is an Internet-Draft and is in full conformance
     with all provisions of Section 10 of RFC2026.
  
     Internet-Drafts are working documents of the Internet
     Engineering Task Force (IETF), its areas, and its working
     groups. Note that other groups may also distribute working
     documents as Internet-Drafts. Internet-Drafts are draft
     documents valid for a maximum of six months and may be updated,
     replaced, or obsoleted by other documents at any time. It is
     inappropriate to use Internet-Drafts as reference material or to
     cite them other than as "work in progress."
  
     The list of current Internet-Drafts can be accessed at
     http://www.ietf.org/ietf/1id-abstracts.txt
  
     The list of Internet-Draft Shadow Directories can be accessed at
     http://www.ietf.org/shadow.html.
  
  Abstract
  
     This document describes sampling and filtering techniques for IP
     packet selection. In two information models (one for sampling,
     one for filtering) it defines what parameters are needed to
     describe the most common selection schemes and shows how
     techniques can be combined to build more elaborate packet
     selectors. The information models are used for configuring the
     selection technique in measurement processes and for reporting
     the technique in use to a collector.
  
  
  
  
  
  
  Zseby, Molina, Raspall, Duffield   Expires August 2004     [Page 1]


  Internet Draft  Techniques for IP Packet Selection    February 2004
  
  
  
  
  Table of Contents
  
     1.   Introduction.................................................2
     2.   Terminology..................................................3
     2.1  General Terminology..........................................3
     2.2  Packet Selection Terminology.................................6
     3.   Scope and Deployment of Packet Selection Techniques..........8
     3.1  Sampling....................................................10
     3.1.1 Systematic Sampling........................................11
     3.1.2 Random Sampling............................................12
     3.1.2.1 n-out-of-N Sampling......................................12
     3.1.2.2 Probabilistic Sampling...................................12
     3.1.2.2.1 Uniform Probabilistic Sampling.........................12
     3.1.2.2.2 Non-Uniform Probabilistic Sampling.....................12
     3.1.2.2.3 Non-Uniform Flow State Dependent Sampling..............13
     3.1.2.2.4 Configuration of non-uniform probabilistic and flow-
     state sampling...................................................13
     4.   Filtering...................................................14
     4.1  Mask/Match filtering........................................14
     4.2  Hash-based Selection........................................15
     4.2.1 Application Examples for Hash-based Selection..............16
     4.2.1.1 Approximation of Random Sampling.........................16
     4.2.1.2 Consistent Packet Selection..............................16
     4.2.2 Guarding Against Pitfalls and Vulnerabilities..............17
     4.2.3 Considerations and Recommendations for Hash-functions......18
     4.2.4 IP Shift-XOR (IPSX) hash function..........................19
     4.2.5 "Bob" hash function........................................20
     4.3  Router State filtering......................................23
     5.   Input Parameters and Information Models.....................24
     5.1 Information Model for Sampling Techniques....................25
     5.2 Information Model for Filtering Techniques...................26
     6.   Composite Techniques........................................29
     6.1 Cascaded filtering->sampling or sampling->filtering..........29
     6.2 Stratified Sampling..........................................30
     7.   Security Considerations.....................................30
     8.   References..................................................31
     9.   Author's Addresses..........................................33
     10.  Intellectual Property Statement.............................34
     11.  Full Copyright Statement....................................34
  
  1. Introduction
  
     Increasing data rates and growing measurement demands increase
     the requirements for data collection resources. High packet
     rates in backbone networks load measurement processes. Demands
     for fine granular results (e.g. per flow analysis) require
     performant and flexible classification algorithms, which are
     usually resource extensive. Furthermore, some measurement
  
  
  
  Zseby, Molina, Raspall, Duffield   Expires August 2004  [Page 2]


  Internet Draft  Techniques for IP Packet Selection    February 2004
  
  
  
     methods require the capturing of packet headers or even parts of
     the payload. All this can lead to an overwhelming amount of
     measurement data, resulting in high demands regarding resources
     for measurement, storage, transport and post processing.
  
     In some cases specialized hardware helps to fulfill these
     demands but on the other hand increases the costs for providing
     the measurement. Since measurements are mainly a supporting
     functionality for the service provisioning, measurement costs
     should be limited to a small fraction of the costs of the
     network service provisioning itself. A reduction of the data
     that is considered and reported by a measurement process is
     crucial to prevent the depletion of the available (i.e. the
     affordable) resources. Such a reduction can be achieved by a
     reasonable deployment of packet selection techniques, that
     sample a subset of the packets while still allowing an
     appropriate accuracy, or filter out all packets that are not of
     interest for the measurement at all. Packet selection helps to
     prevent an exhaustion of resources and to limit the measurement
     costs. Examples for applications that benefit from packet
     selection are given in [Du04].
  
  2. Terminology
  
     The PSAMP terminology resulted from joint discussions of the
     authors of this document together with the authors of [Du04].
     Therefore all terms used throughout this document represent the
     common understanding of the authors of both documents and are
     consistent with those defined in [Du04]. Furthermore, it is
     aimed at consistency with the terms used in [QuZC03].
  
  2.1 General Terminology
  
     * Observation Point: a location in the network where a packet
        stream is observed. Examples include:
  
          (i) a line to which a probe is attached;
  
          (ii) a shared medium, such as an Ethernet-based LAN;
  
          (iii) a single port of a router, or set of interfaces
          (physical or logical) of a router;
  
          (iv) an embedded measurement subsystem within an interface.
  
     * Observed Packet Stream: the set of all packets observed at the
        observation point.
  
  
  
  
  
  Zseby, Molina, Raspall, Duffield   Expires August 2004  [Page 3]


  Internet Draft  Techniques for IP Packet Selection    February 2004
  
  
  
     * Packet Stream: either the observed packet stream, or a subset
        of it.
  
        Note that packets selected from a stream, e.g. by sampling, do
        not necessarily possess a property by which they can be
        distinguished from packets that have not been selected. For
        this reason the term ôstreamö is favored over ôflowö, which is
        defined as set of packets with common properties [QuZC02].
  
     * Selection Process: takes a packet stream as its input and
        selects a subset of that stream as its output.
  
     * Packet Content: the union of the packet header (which includes
        link layer, network layer and other encapsulation headers) and
        the packet payload.
  
     * Selection State: a selection process may maintain state
        information for use by the selection process and/or the
        reporting process. At a given time, the selection state may
        depend on packets observed at and before that time, and other
        variables. Examples include:
  
          (i) sequence numbers of packets at the input of selectors;
  
          (ii) a timestamp of observation of the packet at the
     observation points;
  
          (iii) iterators for pseudorandom number generators;
  
          (iv) hash values calculated during selection;
  
          (v) indicators of whether the packet was selected by a
          given selector;
  
        Selection processes may change portions of the selection state
        as a result of processing a packet.
  
     * Selector: the component that performs a selection process on a
        single packet of its input. A selected packet becomes an
        element of the output packet stream of the selection process.
  
        The selector can make use of the following information in
        determining whether a packet is selected:
  
          (i) the packetÆs content;
  
          (ii) information derived from the packet's treatment at the
     observation point;
  
  
  
  
  Zseby, Molina, Raspall, Duffield   Expires August 2004  [Page 4]


  Internet Draft  Techniques for IP Packet Selection    February 2004
  
  
  
          (iii) any selection state that may be maintained by the
          selection process.
  
     * Composite Selection Process: an ordered composition of
        selection processes, in which the output stream issuing from
        one component forms the input stream for the succeeding
        component.
  
     * Primitive Selection Process: a selection process that is not a
        composite selection process.
  
     * Composite Selector: the selector of a composite selection
        process.
  
     * Primitive Selector: the selector of a primitive selection
        process.
  
     * Reporting Process: creates a report stream on packets selected
        by a selection process, in preparation for export. The input
        to the reporting process comprises that information available
        to the selection process per selected packet, specifically:
  
          (i) the selected packetÆs content or parts of it;
  
          (ii) information derived from the selected packet's
          treatment at the observation point;
  
          (iii) any selection state maintained by the inputting
          selection process, reflecting any modifications to the
          selection state made during selection of the packet.
  
     * Report Stream: the output of a reporting process is a report
        stream, comprising two distinguished types of information:
        packet reports, and report interpretation.
  
     * Packet Reports: a configurable subset of the per packet input
        to the reporting process.
  
     * Report Interpretation: subsidiary information relating to one
        or more packets, that is used for interpretation of their
        packet reports. Examples include configuration parameters of
        the selection process and of the reporting process.
  
     * Measurement Process: the composition of a selection process
        that takes the observed packet stream as its input, followed
        by a reporting process.
  
     * Export Process: sends the output of one or more reporting
        processes to one or more collectors.
  
  
  
  Zseby, Molina, Raspall, Duffield   Expires August 2004  [Page 5]


  Internet Draft  Techniques for IP Packet Selection    February 2004
  
  
  
  
     * Collector: a collector receives a report stream exported by
        one or more export processes. In some cases, the host of the
        measurement and/or export processes may also serve as the
        collector.
  
     * Export packets: one or more packet reports, and perhaps report
        interpretation, are bundled by the export process into a
        export packet for export to a collector.
  
     Various possibilities for the high level architecture of these
     elements are as follows.
  
         MP = Measurement Process, EP = Export Process
  
        +---------------------+                 +------------------+
        |Observation Point(s) |                 | Collector(1)     |
        |MP(s)--->EP----------+---------------->|                  |
        |MP(s)--->EP----------+-------+-------->|                  |
        +---------------------+       |         +------------------+
                                      |
        +---------------------+       |         +------------------+
        |Observation Point(s) |       +-------->| Collector(2)     |
        |MP(s)--->EP----------+---------------->|                  |
        +---------------------+                 +------------------+
  
        +---------------------+
        |Observation Point(s) |
        |MP(s)--->EP---+      |
        |              |      |
        |Collector(3)<-+      |
        +---------------------+
  
     The PSAMP measurement process can be viewed as analogous to the
     IPFIX metering process. The PSAMP measurement process takes an
     observed packet stream as its input, and produces packet reports
     as its output. The IPFIX metering process produces flow records
     as its output. The distinct name ômeasurement processö has been
     retained in order to avoid potential confusion in settings where
     IPFIX and PSAMP coexist, and in order to avoid the implicit
     requirement that the PSAMP version satisfy the requirements of
     an IPFIX metering process (at least while these are under
     development). The relation between PSAMP and IPFIX is further
     discussed in [QC03].
  
  2.2 Packet Selection Terminology.
  
     * Filtering: a filter is a selection operation that selects a
        packet deterministically based on the packet content, its
  
  
  
  Zseby, Molina, Raspall, Duffield   Expires August 2004  [Page 6]


  Internet Draft  Techniques for IP Packet Selection    February 2004
  
  
  
        treatment, and functions of these occurring in the selection
        state. Examples include mask/match filtering, and hash-based
        selection.
  
     * Sampling: a selection operation that is not a filter is called
        a sampling operation. This reflects the intuitive notion that
        if the selection of a packet cannot be determined from its
        content alone, there must be some type of sampling taking
        place.
  
     * Content-independent Sampling: a sampling operation that does
        not use packet content (or quantities derived from it) as the
        basis for selection is called a content-independent sampling
        operation. Examples include systematic sampling, and uniform
        pseudorandom sampling driven by a pseudorandom number whose
        generation is independent of packet content. Note that in
        content-independent sampling it is not necessary to access the
        packet content in order to make the selection decision.
  
     * Content-dependent Sampling: a sampling operation where
        selection is dependent on packet content is called a content-
        dependent sampling operation. Examples include pseudorandom
        selection according to a probability that depends on the
        contents of a packet field. Note that this is not a filter ,
        because the selection is not deterministic..
  
     * Hash Domain: a subset of the packet content and the packet
        treatment, viewed as an N-bit string for some positive integer
        N.
  
     * Hash Range: a set of M-bit strings for some positive integer
        M.
  
     * Hash Function: a deterministic map from the hash domain into
        the hash range.
  
     * Hash Selection Range: a subset of the hash range. The packet
        is selected if the action of the hash function on the hash
        domain for the packet yields a result in the hash selection
        range.
  
     * Hash-based Selection: filtering specified by a hash domain, a
        hash function, and hash range and a hash selection range.
  
     * Approximative Selection: selection operations in any of the
        above categories may be approximated by operations in the same
        or another category for the purposes of implementation. For
        example, uniform pseudorandom sampling may be approximated by
        hash-based selection, using a suitable hash function and hash
  
  
  
  Zseby, Molina, Raspall, Duffield   Expires August 2004  [Page 7]


  Internet Draft  Techniques for IP Packet Selection    February 2004
  
  
  
        domain. In this case, the closeness of the approximation
        depends on the choice of hash function and hash domain.
  
     * Population size: the number of all packets in the packet
        stream (or subset) for which a metric should be estimated
  
     * Sample size: the number of packets selected from the
        population by a selection operation.
  
     * Attained Selection Frequency: the actual frequency with which
        packets are selected by a selection process. The attained
        sampling frequency is calculated as ratio of the size of a
        sample size to the size of the population from which it was
        selected.
  
     * Target Selection Frequency: the long-term frequency with which
        packets are expected to be selected, based on selector
        parameter settings. Depending on the selector, the target
        selection frequency may be count-based or time-based.
  
        Due to the inherent statistical variability of sampling
        decisions, the target and attained selection frequencies can
        differ (e.g. for probabilistic sampling and hash-based
        selection). Nevertheless, for large population sizes and
        properly configured sampling schemes the attained selection
        frequency usually approaches the target selection frequency.
        In hash-based selection, the target selection frequency is the
        quotient of size of the hash selection range by the size of
        the hash range.
  
  3. Scope and Deployment of Packet Selection Techniques
  
     Packet selection techniques generate a subset of packets from an
     Observed Packet Stream at an observation point. We distinguish
     between sampling and filtering.
  
     Sampling is targeted at the selection of a representative subset
     of packets. The subset is used to infer knowledge about the
     whole set of observed packets without processing them all. The
     selection can depend on packet position, and/or on packet
     content, and/or on (pseudo) random decisions.
  
     Filtering selects a subset with common properties. This is used
     if only a subset of packets is of interest. The properties can
     be directly derived from the packet content, or depend on the
     treatment given by the router to the packet. Filtering is a
     deterministic operation. It depends on packet content or router
     treatment. It never depends on packet position or on (pseudo)
     random decisions.
  
  
  
  Zseby, Molina, Raspall, Duffield   Expires August 2004  [Page 8]


  Internet Draft  Techniques for IP Packet Selection    February 2004
  
  
  
     Note that a common technique to select packets is to compute a
     Hash Function on some bits of the packet header and/or content
     and to select it if the Hash Value falls in  the Hash Selection
     Range. Since hashing is a deterministic operation on the packet
     content, it is a filtering technique according to our
     categorization. Nevertheless, hash functions are sometimes used
     to approximate random sampling. Depending on the chosen input
     bits, the Hash Function and the Hash Selection Range, this
     technique can be used to approximate the random selection of
     packets with a given probability p. It is also a powerful
     technique to consistently select the same packet subset at
     multiple observation points [DuGr00]
  
     The following table gives an overview of the schemes described
     in this document and their categorization. An X in brackets (X)
     denotes schemes for which also content-independent variants
     exist. It easily can be seen that only schemes with both
     properties, content dependence and deterministic selection, are
     considered as filters.
  
            Selection Scheme   | deterministic | content- | Category
                               |  selection    | dependent|
       ------------------------+---------------+----------+----------
        systematic             |       X       |     _    | Sampling
        count-based            |               |          |
       ------------------------+---------------+----------+----------
        systematic             |       X       |     -    | Sampling
        time-based             |               |          |
       ------------------------+---------------+----------+----------
        random                 |       -       |     -    | Sampling
        n-out-of-N             |               |          |
       ------------------------+---------------+----------+----------
        random                 |       -       |     -    | Sampling
        uniform probabilistic  |               |          |
       ------------------------+---------------+----------+----------
        random                 |       -       |    (X)   | Sampling
        non-uniform probabil.  |               |          |
       ------------------------+---------------+----------+----------
        random                 |       -       |    (X)   | Sampling
        non-uniform flow-state |               |          |
       ------------------------+---------------+----------+----------
        mask/match filter      |       X       |     X    | Filter
       ------------------------+---------------+----------+----------
        hash function          |       X       |     X    | Filter
       ------------------------+---------------+----------+----------
        router state filter    |       X       |    (X)   | Filter
  
  
  
  Zseby, Molina, Raspall, Duffield   Expires August 2004  [Page 9]


  Internet Draft  Techniques for IP Packet Selection    February 2004
  
  
  
  
       ------------------------+---------------+----------+----------
  
     The introduced categorization is mainly useful for the
     definition of an information model describing Primitive
     Selectors . More complex selection techniques can be described
     through the composition of cascaded sampling and filtering
     operations.
     For example, a packet selection that weights the selection
     probability on the basis of the packet length can be described
     as a cascade of a filter and a sampling scheme. However, this
     descriptive approach is not intended to be rigid: if a common
     and consolidated selection practice turns out to be too complex
     to be described as a composition of the mentioned building
     blocks, an ad hoc description can be specified instead and added
     as a new scheme to the information model.
  
     Packet selectors can be part of an IPFIX metering process and
     can also use the IPFIX exporting process. This is expressed as
     association to one or more IPFIX processes.
  
  3.1 Sampling
  
     The deployment of sampling techniques aims at the provisioning
     of information about a specific characteristic of the parent
     population at a lower cost than a full census would demand. In
     order to plan a suitable sampling strategy it is therefore
     crucial to determine the needed type of information and the
     desired degree of accuracy in advance.
  
     First of all it is important to know the type of metric that
     should be estimated. The metric of interest can range from
     simple packet counts [JePP92] up to the estimation of whole
     distributions of flow characteristics (e.g. packet
     sizes)[ClPB93].
  
     Secondly, the required accuracy of the information and with
     this, the confidence that is aimed at, should be known in
     advance. For instance for usage-based accounting the required
     confidence for the estimation of packet counters can depend on
     the monetary value that corresponds to the transfer of one
     packet. That means that a higher confidence could be required
     for expensive packet flows (e.g. premium IP service) than for
     cheaper flows (e.g. best effort). The accuracy requirements for
     validating a previously agreed quality can also vary extremely
     with the customer demands. These requirements are usually
     determined by the service level agreement (SLA).
  
  
  
  
  
  
  Zseby, Molina, Raspall, Duffield   Expires August 2004  [Page 10]


  Internet Draft  Techniques for IP Packet Selection    February 2004
  
  
  
     The sampling method and the parameters in use must be clearly
     communicated to all applications that use the measurement data.
     Only with this knowledge a correct interpretation of the
     measurement results can be ensured.
  
     Sampling methods can be characterized by the sampling algorithm,
     the trigger type used for starting a sampling interval and the
     length of the sampling interval. These parameters are described
     here in detail. The sampling algorithm describes the basic
     process for selection of samples. In accordance to [AmCa89] and
     [ClPB93] we define the following basic sampling processes:
  
      3.1.1       Systematic Sampling
  
     Systematic sampling describes the process of selecting the start
     points and the duration of the selection intervals according to
     a deterministic function. This can be for instance the periodic
     selection of every k-th element of a trace but also the
     selection of all packets that arrive at pre-defined points in
     time. Even if the selection process does not follow a periodic
     function (e.g. if the time between the sampling intervals varies
     over time) we consider this as systematic sampling as long as
     the selection is deterministic.
     The use of systematic sampling always involves the risk of
     biasing the results. If the systematics in the sampling process
     resemble systematics in the observed stochastic process
     (occurrence of the characteristic of interest in the network),
     there is a high probability that the estimation will be biased.
     Systematics in the observed process might not be known in
     advance.
     Here only equally spaced schemes are considered, where triggers
     for sampling are periodic, either in time or in packet count.
     All packets occurring in a selection interval (either in time or
     packet count) beyond the trigger are selected.
  
     Systematic count-based
     In systematic count-based sampling the start and stop triggers
     for the sampling interval are defined in accordance to the
     spatial packet position (packet count).
  
     Systematic time-based
     In systematic count-based sampling the start and stop triggers
     for the sampling interval are defined in accordance to the
     temporal packet position (arrival time).
  
     Both schemes are contentûindependent selection schemes. Content
     dependent deterministic selectors are categorized as filter.
  
  
  
  
  
  Zseby, Molina, Raspall, Duffield   Expires August 2004  [Page 11]


  Internet Draft  Techniques for IP Packet Selection    February 2004
  
  
  
  3.1.2   Random Sampling
  
     Random sampling selects the starting points of the sampling
     intervals in accordance to a random process. The selection of
     elements are independent experiments. With this, unbiased
     estimations can be achieved. In contrast to systematic sampling,
     random sampling requires the generation of random numbers. One
     can differentiate two methods of random sampling:
  
  3.1.2.1 n-out-of-N Sampling
  
     In n-out-of-N sampling n elements are selected out of the parent
     population that consists of N elements. One example would be to
     generate n different random numbers in the range [1,N] and
     select all packets which have a packet position equal to one of
     the random numbers. For this kind of sampling the sample size n
     is fixed.
  
  3.1.2.2 Probabilistic Sampling
  
     In probabilistic sampling the decision whether an element is
     selected or not is made in accordance to a pre-defined selection
     probability. An example would be to flip a coin for each packet
     and select all packets for which the coin showed the head. For
     this kind of sampling the sample size can vary for different
     trials. The selection probability does not necessarily has to be
     the same for each packet. Therefore we distinguish between
     uniform probabilistic sampling (with the same selection
     probability for all packets) and non-uniform  probabilistic
     sampling (where the selection probability can vary for different
     packets).
  
  
  3.1.2.2.1      Uniform Probabilistic Sampling
  
     For Uniform Probabilistic  Sampling packets are selected
     independently with a uniform probability p. This sampling can be
     count-driven, and is sometimes referred to as geometric random
     sampling, since the difference in count between successive
     selected packets are independent random variables with a
     geometric distribution of mean 1/p. A time-driven analog,
     exponential random  sampling, has the time between triggers
     exponentially distributed.
     Both geometric and exponential random sampling are examples of
     what is known as additive random sampling, defined as sampling
     where the intervals or counts between successive samples are
     independent identically distributed random variable.
  
  3.1.2.2.2      Non-Uniform Probabilistic Sampling
  
  
  
  Zseby, Molina, Raspall, Duffield   Expires August 2004  [Page 12]


  Internet Draft  Techniques for IP Packet Selection    February 2004
  
  
  
  
     This is a variant of Probabilistic Sampling in which the
     sampling probabilities can depend on the selection process
     input. This can be used to weight sampling probabilities in
     order e.g. to boost the chance of sampling packets that are rare
     but are deemed important. Unbiased estimators for quantitative
     statistics are recovered by renormalization of sample values;
     see [HT52].
  
  
  3.1.2.2.3      Non-Uniform Flow State Dependent Sampling
  
     Another type of sampling that can be classified as probabilistic
     Non-Uniform  is closely related to the flow concept as defined
     in [QuZC02], and it is only used jointly with a flow monitoring
     function (IPFIX monitoring function). Packets are selected,
     dependent on a selection state. The point, here, is that the
     selection state is determined also by the state of the flow the
     packet belongs to and/or by the state of the other flows
     currently being monitored by the associated flow monitoring
     function. An example for such an algorithm is the ösample and
     holdö method described in [EsVa01]:
  
     - If a packet accounts for a flow record that already exists in
        the IPFIX flow recording process, it is selected (i.e. the
        flow record is updated)
     - If a packet doesn't account to any existing flow record, it is
        selected with probability p. If it has been selected a new
        flow record has to be created.
  
     A further algorithm that fits into the category of non-uniform
     flow state dependent sampling is described in [Moli03].
  
     This type of sampling is content dependent because the
     identification of the flow the packet belongs to requires
     analyzing part of the packet content. If the packet is selected,
     then it is passed as an input to the IPFIX monitoring function
     (this is called öLocal Exportö in [Du04]
     Selecting the packet depending on the state of a flow cache is
     useful when memory resources of the flow monitoring function are
     scarce (i.e. thereÆs no room to keep all the flows that have
     been scheduled for monitoring). See [MolFl03] for a more
     detailed description of the motivations for this type of
     sampling and the impact on the IPFIX metering.
  
  
  3.1.2.2.4      Configuration of non-uniform probabilistic and
         flow-state sampling
  
  
  
  Zseby, Molina, Raspall, Duffield   Expires August 2004  [Page 13]


  Internet Draft  Techniques for IP Packet Selection    February 2004
  
  
  
     Many different specific methods can be grouped under the terms
     non-uniform probabilistic and flow state sampling. Dependent on
     the sampling goal and the implemented scheme, a different number
     and type of input parameters is required to configure such
     scheme.
  
     Some concrete proposals for such methods exist from the research
     community (e.g. [EsVa01],[DuLT01],[Moli03]). All of these
     proposals are still in an early stage and need further
     investigations to prove their usefulness and applicability.
  
     Since we donÆt want give preference to one of the existing early
     stage methods we here only describe the basic methods and leave
     the specification of explicit schemes and their parameters up to
     vendors (e.g. as extension of the information model).
  
  
  4. Filtering
  
     Filtering is the deterministic selection of packets based on the
     packet content, the treatment of the packet at the observation
     point, or deterministic functions of these occurring in the
     selection state. The packet is selected if these quantities fall
     into a specified range.
     The role of filtering, as the word itself suggest, is to
     separate all the packets having a certain property from those
     not having it. A distinguishing characteristic from sampling is
     that the property never depends on the packet position in time
     or in the space, or on a random process.
     We identify and describe in the following three filtering
     techniques. The first two (Mask/Match filtering and Hashing
     filtering) are stateless, and therefore can make their decision
     based on the analysis of portion of the packet only. The other
     (router state filtering) requires to access state information
     after the analysis of part of the packet and is therefore more
     complex: its usage makes sense only in particular circumstances,
     as described below.
  
  4.1 Mask/Match filtering
  
     This type of filtering selects a packet operating as follows:
  
     A number of bit positions are chosen in accordance to the
     filtering goal. A mask is applied with a logical AND to the
     incoming packet to keep only the chosen bits. The result of this
     operation is then compared to a predefined single value (e.g. a
     specific source IP address), a set of values or a range. The
     packet is selected in accordance to the result of this
     comparison.
  
  
  
  Zseby, Molina, Raspall, Duffield   Expires August 2004  [Page 14]


  Internet Draft  Techniques for IP Packet Selection    February 2004
  
  
  
     The selected bits of the packet arenÆt necessarily only those of
     the IP header. If  bits of the IP header and bits of the payload
     are considered, the masks and the selection intervals MUST be
     specified separately for the header and the payload.
     An implementation is not constrained to apply exactly all the
     steps or in this sequence, provided that the result is
     equivalent to a logical function doing it.
     Examples of filters of this class are filters that select
     packets based on the matching of some of the IP header fields
     with a (possibly masked) specific value, filters that select
     packets having some IP header fields values falling within a
     range, filters that do the same as above on some of the
     transport header fields (that are thus considered as part of the
     payload), or a combination of all of the above mentioned
     possibilities.
  
  
  4.2 Hash-based Selection
  
     A Hash Function h maps the packet content c, or some portion of
     it, onto a Hash Range R. The packet is selected if h(c) is an
     element of S, which is a subset of R called the Hash Selection
     Range. Thus hash-based sampling is indeed a particular case of
     filtering: the object is selected if c is in inv(h(S)). But for
     desirable Hash Functions the inverse image inv(h(S)) will be
     extremely complex, and hence h would not be expressible as, say,
     a match/mask filter or a simple combination of these.
     Hash-based selection has mainly two types of usage: it offers a
     way to approximate random sampling by using packet content to
     generate pseudorandom variates and a way to consistently select
     subsets of packets that share a common property (e.g. at
     different observation points).
     In the following subsections we give more details about them.
     However, both usages require that the Hash Functions has two
     statistical properties. First, the hash function h must have
     good mixing properties, in the sense that small changes in the
     input (e.g. the flipping of a single bit) cause large changes in
     the output (many bits change). Then any local clump of values of
     c is spread widely over R by h, and so the distribution of h(c)
     is fairly uniform even if the distribution of c is not. Then the
     sampling rate is #S/#R, which can be tuned by choice of S. If S
     and R are sets contiguous integers, h(c), suitably shifted and
     normalized, can be interpreted as a pseudorandom variate. The
     second desirable property depends more closely on the statistics
     of the content c. In applications, the content c comprises a
     number of distinct fields, c1 ... cm, e.g. source and
     destination IP Address, IP identification, and TCP/UDP port
     numbers (if present) for a packet. With a hash function
     satisfying the first properties above, selection decisions will
  
  
  
  Zseby, Molina, Raspall, Duffield   Expires August 2004  [Page 15]


  Internet Draft  Techniques for IP Packet Selection    February 2004
  
  
  
     appear uncorrelated with the contents of any individual field,
     if the complementary fields are (i) sufficiently variable
     themselves, and (ii) sufficiently uncorrelated with cj.
  
      4.2.1       Application Examples for Hash-based Selection
  
  4.2.1.1 Approximation of Random Sampling
  
     Although pseudorandom number generators with well understood
     properties have been developed, they may not be the method of
     choice in setting where computational resources are scarce. A
     convenient alternative is to use hash functions of packet
     content as a source of randomness. The hash (suitably
     renormalized) is a pseudorandom variate in the interval [0,1].
     Other schemes may use packet fields in iterators for
     pseudorandom numbers.
     The point here, is that the statistical properties of the
     idealized packet selection law (such as independence of sampling
     decisions for different packets, or independence on packet
     content) may not be exactly shared by an implementation, but
     only approximately so.
     Although the selection decisions for non-uniform probabilistic
     sampling (see Section 3.1.2.2.2 above) also depend on the packet
     content, this form of sampling is distinguished from the use of
     packet content to generate variates. In the former case, the
     content only determines the selection probabilities: selection
     could then proceed e.g by use of a variates obtained by an
     independent pseudorandom number generator. In the latter case,
     the content determines the pseudorandom variates rather than the
     probabilities.
  
  
  4.2.1.2 Consistent Packet Selection
  
     In Consistent packet selection, all routers in a network hash
     parts of the packet content using identical hash function and
     selection range. The domain of the hash is restricted to those
     parts of the packet that are invariant from hop to hop. Fields
     such as Time-to-Live, which is decremented per hop, and header
     CRC, which is recalculated per hop, are thus excluded from the
     hash domain. Thus, a given packet is selected at either all
     points on its path through the network, or at none. The domain
     of the hash function needs to be wider than just a flow key, if
     packets are to be selected quasi-randomly within flows (and e.g.
     include portions of the payload), see [DuGr00].
  
     If a report on each selected packet is exported to a collector,
     the collector can reconstruct trajectories of the selected
     packets, provided it can match different reports on the same
  
  
  
  Zseby, Molina, Raspall, Duffield   Expires August 2004  [Page 16]


  Internet Draft  Techniques for IP Packet Selection    February 2004
  
  
  
     packet, and distinguish these from reports on other packets. For
     this purpose a second distinct hash function may be used to
     generate a label or digest of each selected packet for inclusion
     in its packet report. The benefit of using a digest is reduction
     in reporting bandwidth. Routers need only report the digest
     provided at least one router (say an edge router) sends a full
     packet report containing packet fields of interest in addition
     to the digest. A digest size of 32 bits has been found
     sufficient to distinguish packets; see [DuGeGr02].
  
  
     Applications of consistent packet selection  include (i)
     estimation of the network path matrix, i.e., the traffic
     intensities according to network path, broken down by flow; (ii)
     detection of routing loops, as indicated by self-intersecting
     trajectories; (iii) passive performance measurement: prematurely
     terminating trajectories indicate packet loss, packet one way
     delay can be determined if reports include (synchronized)
     timestamps; (iv) network attack tracing, through determination
     of the actual paths taken by attack packets with spoofed source
     addresses.
  
      4.2.2       Guarding Against Pitfalls and Vulnerabilities
  
     A concern is whether some large set of related packets could be
     sampled at a rate that significantly differs from the expected
     sampling rate, either (i) through unanticipated behavior in the
     hash function, or (ii) because the packets had been deliberately
     crafted to have this property.
     The first point underlines the importance of using a hash
     function with good mixing properties. Examples of such are CRC32
     and hash functions based on modular arithmetic, see 6.4 in
     [Knuth98]. The statistical properties of candidate hash
     functions need to be evaluated, preferably on packet before
     adoption for hash-based sampling
     Hash sampling could be overloaded (or evaded) by an attacker if
     the hash function and the selection rate are both known. A
     service provider could build a first defense keeping the Hash
     Selection Range S private. Then, an attacker could not determine
     whether a crafted packet is selected, but he would still know
     that a crafted set of packets all with the same hash is either
     all selected or all not selected. Moreover, when applications
     (like multi domain trajectory sampling, or one way delay
     estimation across multiple domains) may require multiple
     administrative entities to agree on a common hash function and
     selection range, mutual trust between the entities cannot be
     avoided and just keeping S secret may not be feasible. A
     stronger defense is to employ a parametrizable hash function and
     keep the parameter private: in this case, the set of hash values
  
  
  
  Zseby, Molina, Raspall, Duffield   Expires August 2004  [Page 17]


  Internet Draft  Techniques for IP Packet Selection    February 2004
  
  
  
     of the packets could not be determined. Examples of parameters
     are the initial vector in CRC32, and moduli in hashes based on
     modular arithmetic.
  
      4.2.3       Considerations and Recommendations for Hash
           Functions.
  
     For hash-based selection, the most important requirements for a
     hash function are simplicity of implementation and speed of
     operation, followed by uniformity of hash distribution.
     Simplicity promotes wider implementation and the ability to
     operate at line speed. Uniformity of hash promotes the
     approximation of random sampling by hash-based selection.
  
     For a packet digesting hash, uniformity of hash distribution and
     a small collision probability are the most important
     requirements. Since the digest need only be computed over the
     substream of selected packets, a digesting hash does not have
     the same speed requirements as a sampling hash. Thus digesting
     enables and benefits from the use of more complex hash functions
     than hash-based selection.
  
     The properties of a number of different hash functions were
     evaluated and compared. On this basis, the following
     recommendations are made for PSAMP hash functions.
  
     When hash-based packet selection is supported, the following
     hash functions MUST be offered:
  
     * The IPSX hash function; see Section 4.2.4 below. IPSX is a
        fast hash function with good uniformity properties, and is
        intended for hash-based selection. It operates with fixed
        length input tailored for IPv4 packets, and produces a 16 bit
        output, enabling sampling down to rates of 1 in 65536.
  
  
     * The CRC32 hash function; see [ISO3309]. CRC32 has small
        collision probabilities, its 32 bit output being sufficiently
        large to function as a digest. Its use as a selection hash is
        not precluded, however it is roughly an order of magnitude
        slower than IPSX and so is not expected to function as widely
        for this purpose. It can accept a variable length input and so
        provides a potential future path for hash-based selection of
        packets of protocols other than IPv4.
  
     Other hash functions MAY be provided. A candidate hash function
     is the BOB hash function described in Section 4.2.5 below. Its
     performance in speed, uniformity and collisions are comparable
     with (and slightly superior to) CRC32. As remarked in Section 9
  
  
  
  Zseby, Molina, Raspall, Duffield   Expires August 2004  [Page 18]


  Internet Draft  Techniques for IP Packet Selection    February 2004
  
  
  
     of [Du04], the MD5 hash function may be precluded for ubiquitous
     use at full line rate, at least for hash-based selection.
  
      4.2.4        IP Shift-XOR (IPSX) hash function
  
     The IPSX hash function is tailored for acting on IP version 4
     packets. It exploits the structure of IP packet and in
     particular the variability expected to be exibited within
     different fields of the IP packet in order to furnish a hash
     value with little apparent correlation with individual packet
     fields. Fields from the IPv4 and TCP/UDP headers are used as
     input. The IPSX hash function uses a small number of simple
     instructions.
  
     Input parameters: None
  
     Built-in parameters: None
  
     Output: The output of the IPSX is a 16 bit number
  
     Functioning:
     The functioning can be divided into two parts: input selection,
     which forms are composite input from various portions of the IP
     packet, followed by computation of the hash on the composite.
  
     Input Selection:
     The raw input is drawn from the first 20 bytes of the IP packet
     header and the first 8 bytes of the IP payload. If IP options
     are not used, the IP header has 20 bytes, and hence the two
     portions adjoin and comprise the first 28 bytes of the IP
     packet. We now use the raw input as 4 32-bit subportions of
     these 28 bytes. We specify the input by bit offsets from the
     start of IP header or payload.
  
     f1 = bits 32 to 63 of the IP header, comprising the IP
          identification field, flags, and fragment offset.
  
     f2 = bits 96 to 127 of the IP header, the source IP address.
  
     f3 = bits 128 to 159 of the IP header, the destination IP
          address.
  
     f4 = bits 32 to 63 of the IP payload. For a TCP packet, f4
          comprises the TCP sequence number followed by the message
          length. For a UDP packet f4 comprises the UDP checksum.
  
     Hash Computation:
  
  
  
  
  
  Zseby, Molina, Raspall, Duffield   Expires August 2004  [Page 19]


  Internet Draft  Techniques for IP Packet Selection    February 2004
  
  
  
     The hash is computed from f1, f2, f3 and f4 by a combination of
     XOR (^), right shift (>>) and left shift (<<) operations. The
     intermediate quantities h1, v1, v2 are 32-bit numbers.
  
            1.    v1 = f1 ^ f2;
            2.    v2 = f3 ^ f4;
            3.    h1 = v1 << 8;
            4.    h1 ^= v1 >> 4;
            5.    h1 ^= v1 >> 12;
            6.    h1 ^= v1 >> 16;
            7.    h1 ^= v2 << 6;
            8.    h1 ^= v2 << 10;
            9.    h1 ^= v2 << 14;
            10.   h1 ^= v2 >> 7;
  
     The output of the hash is the least significant 16 bits of h1.
  
  
      4.2.5       "Bob" hash function
  
  
  
  
  
  
     "Bob" hash function is a hash function designed for having each
     bit of the input affecting every bit of the return value and
     using both 1-bit and 2-bit deltas to achieve the so called
     avalanche effect [Jenk97]. This function was originally built
     for hash table lookup with fast software implementation.
  
     Input Parameters:
     The input parameters of such a function are:
     - the length of the input string (key) to be hashed, in
     bytes. The elementary input blocks of Bob hash are the single
     bytes, therefore no padding is needed.
     - an init value (an arbitrary 32-bit number).
  
     Built in parameters:
     The Bob Hash uses the following built-in parameter:
     - the golden ratio (an arbitrary 32-bit number used in the hash
     function computation: its purpose is to avoid mapping all zeros
     to all zeros);
  
     Note: the mix sub-function (see mix (a,b,c) macro in the
     reference code in 3.2.4) has a number of parameters governing
     the shifts in the registers. The one presented is not the only
     possible choice.
  
     It is an open point whether these may be considered additional
     built-in parameters to specify at function configuration.
  
     Output.
  
  
  
  Zseby, Molina, Raspall, Duffield   Expires August 2004  [Page 20]


  Internet Draft  Techniques for IP Packet Selection    February 2004
  
  
  
     The output of the BOB function is a 32-bit number. It should be
     specified:
     - A 32 bit mask to apply to the output
     - The selection range as a list of non overlapping intervals
     [start value, end value] where value is in [0,2^32]
  
     Functioning:
     The hash value is obtained computing first an initialization of
     an internal state (composed of 3 32-bit numbers, called a, b, c
     in the reference code below), then, for each input byte of the
     key the internal state is combined by addition and mixed using
     the mix sub-function. Finally, the internal state mixed one last
     time and the third number of the state (c) is chosen as the
     return value.
  
     typedef  unsigned long  int  ub4;   /* unsigned 4-byte
     quantities */
     typedef  unsigned       char ub1;   /* unsigned 1-byte
     quantities */
  
     #define hashsize(n) ((ub4)1<<(n))
     #define hashmask(n) (hashsize(n)-1)
  
     /* ------------------------------------------------------
       mix -- mix 3 32-bit values reversibly.
       For every delta with one or two bits set, and the deltas of
     all three high bits or all three low bits, whether the original
     value of a,b,c is almost all zero or is uniformly distributed,
       * If mix() is run forward or backward, at least 32 bits in
     a,b,c have at least 1/4 probability of changing.
       * If mix() is run forward, every bit of c will change between
     1/3 and 2/3 of the time.  (Well, 22/100 and 78/100 for some 2-
     bit deltas.) mix() was built out of 36 single-cycle latency
     instructions in a structure that could supported 2x parallelism,
     like so:
             a -= b;
             a -= c; x = (c>>13);
             b -= c; a ^= x;
             b -= a; x = (a<<8);
             c -= a; b ^= x;
             c -= b; x = (b>>13);
             ...
     Unfortunately, superscalar Pentiums and Sparcs can't take
     advantage of that parallelism.  They've also turned some of
     those single-cycle latency instructions into multi-cycle latency
     instructions
  
     ------------------------------------------------------------*/
  
  
  
  
  Zseby, Molina, Raspall, Duffield   Expires August 2004  [Page 21]


  Internet Draft  Techniques for IP Packet Selection    February 2004
  
  
  
       #define mix(a,b,c)  \
       { \
         a -= b; a -= c; a ^= (c>>13); \
         b -= c; b -= a; b ^= (a<<8); \
         c -= a; c -= b; c ^= (b>>13); \
         a -= b; a -= c; a ^= (c>>12);  \
         b -= c; b -= a; b ^= (a<<16); \
         c -= a; c -= b; c ^= (b>>5); \
         a -= b; a -= c; a ^= (c>>3);  \
         b -= c; b -= a; b ^= (a<<10); \
         c -= a; c -= b; c ^= (b>>15); \
       }
  
       /* -----------------------------------------------------------
     hash() -- hash a variable-length key into a 32-bit value
     k       : the key (the unaligned variable-length array of bytes)
     len     : the length of the key, counting by bytes
     initval : can be any 4-byte value
     Returns a 32-bit value.  Every bit of the key affects every bit
     of the return value.  Every 1-bit and 2-bit delta achieves
     avalanche. About 6*len+35 instructions.
  
     The best hash table sizes are powers of 2.  There is no need to
     do mod a prime (mod is sooo slow!).  If you need less than 32
     bits, use a bitmask.  For example, if you need only 10 bits, do
     h = (h & hashmask(10));
     In which case, the hash table should have hashsize(10) elements.
  
     If you are hashing n strings (ub1 **)k, do it like this:
     for (i=0, h=0; i<n; ++i) h = hash( k[i], len[i], h);
  
     By Bob Jenkins, 1996.  bob_jenkins@burtleburtle.net.  You may
     use this code any way you wish, private, educational, or
     commercial.  It's free.
  
    See http://burtleburtle.net/bob/hash/evahash.html
     Use for hash table lookup, or anything where one collision in
     2^^32 is acceptable.  Do NOT use for cryptographic purposes.
      ----------------------------------------------------------- */
  
       ub4 bob_hash(k, length, initval)
       register ub1 *k;        /* the key */
       register ub4  length;   /* the length of the key */
       register ub4  initval;  /* an arbitrary value */
       {
          register ub4 a,b,c,len;
  
          /* Set up the internal state */
          len = length;
  
  
  
  Zseby, Molina, Raspall, Duffield   Expires August 2004  [Page 22]


  Internet Draft  Techniques for IP Packet Selection    February 2004
  
  
  
          a = b = 0x9e3779b9; /*the golden ratio; an arbitrary value
     */
          c = initval;         /* another arbitrary value */
  
     /*------------------------------------ handle most of the key */
  
          while (len >= 12)
          {
             a += (k[0] +((ub4)k[1]<<8) +((ub4)k[2]<<16)
     +((ub4)k[3]<<24));
             b += (k[4] +((ub4)k[5]<<8) +((ub4)k[6]<<16)
     +((ub4)k[7]<<24));
             c += (k[8] +((ub4)k[9]<<8)
     +((ub4)k[10]<<16)+((ub4)k[11]<<24));
             mix(a,b,c);
             k += 12; len -= 12;
          }
  
          /*---------------------------- handle the last 11 bytes */
          c += length;
          switch(len)       /* all the case statements fall through*/
          {
          case 11: c+=((ub4)k[10]<<24);
          case 10: c+=((ub4)k[9]<<16);
          case 9 : c+=((ub4)k[8]<<8);
             /* the first byte of c is reserved for the length */
          case 8 : b+=((ub4)k[7]<<24);
          case 7 : b+=((ub4)k[6]<<16);
          case 6 : b+=((ub4)k[5]<<8);
          case 5 : b+=k[4];
          case 4 : a+=((ub4)k[3]<<24);
          case 3 : a+=((ub4)k[2]<<16);
          case 2 : a+=((ub4)k[1]<<8);
          case 1 : a+=k[0];
            /* case 0: nothing left to add */
          }
          mix(a,b,c);
          /*-------------------------------- report the result */
          return c;
       }
  
  
  
  
  4.3 Router State filtering
  
     This class of filters select a packet on the basis of router
     state conditions. The following list gives examples for such
  
  
  
  
  Zseby, Molina, Raspall, Duffield   Expires August 2004  [Page 23]


  Internet Draft  Techniques for IP Packet Selection    February 2004
  
  
  
     conditions. Conditions can be  combined with AND, OR or NOT
     operators.
  
        - Ingress interface at which the packet arrives equals a
           specified value
        - Egress interface to which the packet is routed equals a
           specified value
        - Packet violated Access Control List (ACL) on the router
        - Reverse Path Forwarding (RPF) failed for the packet
        - Resource Reservation is insufficient for the packet
        - No route found for the packet
        - Origin AS equals a specified value or lies within a given
           range
        - Destination AS equals a specified value or lies within a
           given range
  
     Router architectural considerations may preclude some
     information concerning the packet treatment, e.g. routing state,
     being available at line rate for selection of packets. However,
     if selection not based on routing state has reduced down from
     line rate, subselection based on routing state may be feasible.
  
  5. Input Parameters and Information Models
  
     This section gives an overview of different alternative
     selection schemes and their required parameters. In order to be
     compliant with PSAMP it is sufficient to implement one of the
     proposed schemes.
  
     The decision whether to select a packet or not is based on a
     function which is performed when the packet arrives at the
     selection process. Packet selection schemes differ in the input
     parameters for the selection process and the functions they
     require to do the packet selection. The following table gives an
     overview.
  
          Scheme       |   input parameters     |     functions
        ---------------+------------------------+-------------------
         systematic    |    packet position     |  packet counter
         count-based   |    sampling pattern    |
        ---------------+------------------------+-------------------
         systematic    |      arrival time      |  clock or timer
         time-based    |     sampling pattern   |
        ---------------+------------------------+-------------------
           random      |  packet position       |  packet counter,
         n-out-of-N    |  sampling pattern      |  random numbers
                       | (random number list)   |
        ---------------+------------------------+-------------------
         uniform       |        sampling        |  random function
  
  
  
  Zseby, Molina, Raspall, Duffield   Expires August 2004  [Page 24]


  Internet Draft  Techniques for IP Packet Selection    February 2004
  
  
  
         probabilistic |      probability       |
        ---------------+------------------------+-------------------
         non-uniform   |e.g. packet position,   | selection function,
         probabilistic |  packet content(parts) |  probability calc.
        ---------------+------------------------+-------------------
         non-uniform   |e.g. flow state,        | selection function,
         flow-state    |  packet content(parts) |  probability calc.
        ---------------+------------------------+-------------------
         mask/match    | packet content(parts)  |  filter function
        ---------------+------------------------+-------------------
         hash-based    |  packet content(parts) |  hash function
        ---------------+------------------------+-------------------
         router state  |    router state        |   router state
                       |                        |   discovery
        ---------------+------------------------+-------------------
  
  
  5.1 Information Model for Sampling Techniques
  
     In this section we define the information models for most common
     sampling techniques. Here the selection function is pre-defined
     and given by the selector ID.
  
     Sampler Description:
          SELECTOR_ID
          SELECTOR_TYPE
          SELECTOR_PARAMETERS
          ASSOCIATIONS
  
     Where:
  
     SELECTOR_ID:
     Unique ID for the packet sampler. The ID can be calculated under
     consideration of the ASSOCIATIONS and a local ID.
  
  
     SELECTOR_TYPE
     For sampling processes the SELECTOR TYPE defines what sampling
     algorithm is used.
     Values: Systematic Count-based | Systematic Time-based | Random
     n-out-of-N | Uniform Probabilistic | Non-uniform Probabilistic |
     Non-uniform Flow-state
  
  
     SELECTOR_PARAMETERS
     For sampling processes the SELECTOR PARAMETERS define the input
     parameters for the process. Interval length in systematic
     sampling means, that all packets that arrive in this interval
     are selected. The spacing parameter defines the spacing in time
  
  
  
  Zseby, Molina, Raspall, Duffield   Expires August 2004  [Page 25]


  Internet Draft  Techniques for IP Packet Selection    February 2004
  
  
  
     or number of packets between the end of one sampling interval
     and the start of the next succeeding interval.
  
     Case n out of N:
        - Population size N, Sample size n
  
     Case Systematic Time Based:
        - Interval length (in usec), Spacing (in usec)
  
     Case Systematic Count Based:
        - Interval length(in packets), Spacing (in packets)
  
     Case Uniform Probabilistic (with equal probability per packet):
        - Sampling probability p
  
     Case Non-uniform Probabilistic:
        - Calculation function for sampling probability p (see also
           section 3.1.2.2.4)
  
     Case flow state:
        - Information reported for flow state can be found in
           [MolFl03](see also section 3.1.2.2.4)
  
     ASSOCIATIONS
     The ASSOCIATIONS field describes the observation point and
     (possibly) the IPFIX processes to which the packet selector is
     associated. The STREAM ID denotes the origin of the data stream
     that is input to the selection function. It can be the
     observation point directly or the ID of another selector. With
     this it is possible to define combined schemes. If the STREAM ID
     contains IDs from other selectors, one can derive the original
     observation point from the selector definitions of these
     specified selectors.
  
     Values: <STREAM ID, IPFIX Metering process ID, IPFIX Exporting
     process ID, IDs of other associated processes>
     With STREAM ID: Observation point ID AND List of SELECTOR_IDs
  
  
  5.2 Information Model for Filtering Techniques
  
     In this section we define the information models for most common
     filtering techniques. The information model structure closely
     parallels the one presented for the sampling techniques.
  
     Filter Description:
          SELECTOR_ID
          SELECTOR_TYPE
          SELECTOR_PARAMETERS
  
  
  
  Zseby, Molina, Raspall, Duffield   Expires August 2004  [Page 26]


  Internet Draft  Techniques for IP Packet Selection    February 2004
  
  
  
          ASSOCIATIONS
  
     Where:
  
     SELECTOR_ID:
     Unique ID for the packet filter. The ID can be calculated under
     consideration of the ASSOCIATIONS and a local ID.
  
     SELECTOR_TYPE
     For filtering processes the SELECTOR TYPE defines what filtering
     type is used.
     Values: Matching | Hashing | Router_state
  
     SELECTOR_PARAMETERS
     For filtering processes the SELECTOR PARAMETERS define formally
     the common property of the packet being filtered. For the
     filters of type Matching and Hashing the definitions have a lot
     of points in common.
     Values:
     Case Matching
        - <Header type = ip v4 >
        - <bit specification, header part>
        - <Selection interval specification, header part>
        - <Header type = ipv6>
        - <bit specification, header part>
        - <Selection interval specification, header part>
        - <payload byte number N>
        - <bit specification, payload part>
        - <Selection interval specification, payload part>
  
     Notes to Case Matching:
  
        - The filter can be defined for the header part only, for the
           payload part only or for both. In the latter case the
           matching must be an AND of the two.
        - The bit specification, for the header part, can be
           specified for ipv4 or ipv6 only, or both
        - In case of ipv4, the bit specification is a sequence of 20
           Hexadecimal numbers [00,FF] specifying a 20 bytes bitmask
           to be applied to the header
        - In case of ipv6, it is a sequence of 40 Hexadecimal numbers
           [00,FF] specifying a 40 bytes bitmask to be applied to the
           header
        - The bit specification, for the payload part, is a sequence
           of Hexadecimal numbers [00,FF] specifying the bitmask to be
           applied to the first N bytes of the payload, as specified
           by the previous field. In case the Hexadecimal number
           sequence is longer then N, only the first N numbers are
           considered.
  
  
  
  Zseby, Molina, Raspall, Duffield   Expires August 2004  [Page 27]


  Internet Draft  Techniques for IP Packet Selection    February 2004
  
  
  
        - In case the payload is shorter than N, the packet will not
           match the filter Other options, like padding with zeros,
           may be considered in the future.
        - The selection interval specification is a list of non
           overlapping intervals [intv_begin, intv_end] where
           intv_begin, intv_end are bit strings of length 20*8 (ipv4
           case), 40*8 (ipv6 case), N*8 (payload case).
        - A filter cannot be defined on the options field of the ipv4
           header, neither on stacked headers of ipv6.
        - This specification doesnÆt preclude the future definition
           of a high level syntax for defining in a concise way bit
           selection and matching rules in a more human readable form
           (e.g. ôTCP port in [2000,3000]ö). The requirement is that
           such a syntax can be univoquely compiled into the one
           defined above
  
     Case Hashing:
        - <Header type = ipv4>
        - <Input bit specification, header part>
        - <Header type =  ipv6>
        - <Input bit specification, header part>
        - <payload byte number N>
        - <Input bit specification, payload part>
        - <additional hash input bits (seed)>
        - Hashing function specification
             - Hash function name
             - Length of input key (eliminate 0x bytes)
             - Output value (length M and bitmask)
             - Selection interval specification, as a list of non
               overlapping intervals [start value, end value] where
               value is in [0,2^M-1]
             - Additional parameters dependent on specific hash
               function
  
     Notes to Case Hashing:
  
        - On Input bit specifications fields, the same notes on bit
           specifications of the Matching case reported above apply
        - Some hash functions, their detailed functioning and their
           specific parameter list are described in [NiMD03].
  
     Case Router State:
  
        - Ingress interface at which the packet arrives equals a
           specified value
        - Egress interface to which the packet is routed equals a
           specified value
        - Packet violated Access Control List (ACL) on the router
        - Reverse Path Forwarding (RPF) failed for the packet
  
  
  
  Zseby, Molina, Raspall, Duffield   Expires August 2004  [Page 28]


  Internet Draft  Techniques for IP Packet Selection    February 2004
  
  
  
        - Resource Reservation is insufficient for the packet
        - No route found for the packet
        - Origin AS equals a specified value or lies within a given
           range
        - Destination AS equals a specified value or lies within a
           given range
  
     Note to Case Router State:
        - All Router state entries can be linked by AND, OR, NOT
           operators
  
     ASSOCIATIONS
     The ASSOCIATIONS field describes the observation point and
     (possibly) the IPFIX processes to which the packet selector is
     associated. The STREAM ID denotes the origin of the data stream
     that is input to the selection function. It can be the
     observation point directly or the ID of another selector. With
     this it is possible to define combined schemes. If the STREAM ID
     contains IDs from other selectors, one can derive the original
     observation point from the selector definitions of these
     specified selectors.
  
     Values: <STREAM ID, IPFIX Metering process ID, IPFIX Exporting
     process ID, IDs of other associated processes>
     With STREAM ID: Observation point ID AND List of SELECTOR_IDs
  
  
  6. Composite Techniques
  
     Composite schemes are realized by using the STREAM ID in the
     information models. The STREAM ID denotes from which selectors
     the input stream originates. If multiple stream IDs are given,
     this means that the selector operates on the packet stream
     simply resulting from the time superposition of the output of
     all the listed filters and samplers. Some examples of composite
     schemes are reported below.
  
  6.1 Cascaded filtering->sampling or sampling->filtering
  
     If a filter precedes a sampling process the role of filtering is
     to create a set of ôparent populationsö from a single stream
     that can then be fed independently to different sampling
     functions, with different parameters tuned for the population
     itself (e.g. if streams of different intensity result from
     filtering, it may be good to have different sampling rates). If
     filtering follows a sampling process, the same sampling rate and
     type is applied to the whole stream, independently of the
     relative size of the streams resulting from the filtering
     function. Moreover, also packets not destined to be selected in
  
  
  
  Zseby, Molina, Raspall, Duffield   Expires August 2004  [Page 29]


  Internet Draft  Techniques for IP Packet Selection    February 2004
  
  
  
     the filtering operation will ôloadö the sampling function. So,
     in principle, filtering before sampling allows a more accurate
     tuning of the sampling procedure, but if filters are too complex
     to work at full line rate (e.g. because they have to access
     router state information), sampling before filtering may be a
     need.
  
  6.2 Stratified Sampling
  
     Stratified sampling is one example for using a composite
     technique. The basic idea behind stratified sampling is to
     increase the estimation accuracy by using a-priori information
     about correlations of the investigated characteristic with some
     other characteristic that is easier to obtain. The a-priori
     information is used to perform an intelligent grouping of the
     elements of the parent population. With this a higher estimation
     accuracy can be achieved with the same sample size or the sample
     size can be reduced without reducing the estimation accuracy.
  
     Stratified sampling divides the sampling process into multiple
     steps. First, the elements of the parent population are grouped
     into subsets in accordance to a given characteristic. This
     grouping can be done in multiple steps. Then samples are taken
     from each subset.
  
     The stronger the correlation between the characteristic used to
     divide the parent population (stratification variable) and the
     characteristic of interest (for which an estimate is sought
     after), the easier is the consecutive sampling process and the
     higher is the stratification gain. For instance if the dividing
     characteristic were equal to the investigated characteristic,
     each element of the sub-group would be a perfect representative
     of that characteristic. In this case it would be sufficient to
     take one arbitrary element out of each subgroup to get the
     actual distribution of the characteristic in the parent
     population. Therefore stratified sampling can reduce the costs
     for the sampling process (i.e. the number of samples needed to
     achieve a given level of confidence).
  
     For stratified sampling one has to specify classification rules
     for grouping the elements into subgroups and the sampling scheme
     that is used within the subgroups. The classification rules can
     be expressed by multiple filters. For the sampling scheme within
     the subgroups the parameters have to be specified as described
     above. The use of stratified sampling methods for measurement
     purposes is described for instance in [ClPB93] and [Zseb03].
  
  7. Security Considerations
  
  
  
  
  Zseby, Molina, Raspall, Duffield   Expires August 2004  [Page 30]


  Internet Draft  Techniques for IP Packet Selection    February 2004
  
  
  
     Malicious users or attackers may be interested to hide packets
     from service providers or network operators. For instance if
     packet selectors are used for accounting or intrusion detection
     applications, users may want to prevent that packets are
     selected. If a deterministic sampling scheme is used or a
     selection scheme that takes packet content into account, the
     user can shape or send packets in a way that they are less
     likely to be selected (see also section 4.2.2). Even if the
     selection function is unknown to the user, some insight into the
     function can be obtained by performing experiments with
     different packet sequences. This has to be taken into account
     when choosing an appropriate packet selection technique.
  
     Further security threats can occur if the configuration of
     sampling parameters or the communication of sampling parameters
     to the application is corrupted. This document only describes
     sampling schemes that can be used for packet selection. It
     neither describes a mechanism how those parameters are
     configured nor how these parameters are communicated to the
     application. Therefore the security threats that originate from
     this kind of communication cannot be assessed with the
     information given in this document.
  
  
  8. References
  
     [AmCa89]    Paul D. Amer, Lillian N. Cassel: Management of
                  Sampled Real-Time Network Measurements, 14th
                  Conference on Local Computer Networks, October
                  1989, Minneapolis, pages 62-68, IEEE, 1989
  
     [ClPB93]    K.C. Claffy, George C Polyzos, Hans-Werner Braun:
                  Application of Sampling Methodologies to Network
                  Traffic Characterization, Proceedings of ACM
                  SIGCOMM'93, San Francisco, CA, USA, September 13 -
                  17, 1993
  
     [CoGi98]    I. Cozzani, S. Giordano: Traffic Sampling Methods
                  for end-to-end QoS Evaluation in Large
                  Heterogeneous Networks. Computer Networks and ISDN
                  Systems, 30 (16-18), September 1998.
  
     [Du04]      N.G. Duffield (Ed.), A Framework for Packet
                  Selection and Reporting, Internet Draft draft-ietf-
                  psamp-framework-05, work in progress, January 2004
  
     [DuGr00]    N.G. Duffield, M. Grossglauser: Trajectory Sampling
                  for Direct Traffic Observation, Proceedings of ACM
  
  
  
  
  Zseby, Molina, Raspall, Duffield   Expires August 2004  [Page 31]


  Internet Draft  Techniques for IP Packet Selection    February 2004
  
  
  
                  SIGCOMM 2000, Stockholm, Sweden, August 28 -
                  September 1, 2000.
  
     [DuGeGr02]   N.G. Duffield, A. Gerber, M. Grossglauser,
                  Trajectory Engine: A Backend for Trajectory
                  Sampling, IEEE Network Operations and Management
                  Symposium 2002, Florence, Italy, April 15-19, 2002.
  
  
     [DuLT01]    Nick Duffield, Carsten Lund, and Mikkel Thorup:
                  Charging from Sampled Network Usage, ACM Internet
                  Measurement Workshop IMW 2001, San Francisco, USA,
                  November 1-2, 2001
  
     [EsVa01]    C. Estan and G. Varghese, ôNew Directions in
                  Traffic Measurement and Accountingö, ACM SIGCOMM
                  Internet Measurement Workshop 2001, San Francisco
                  (CA) Nov. 2001
  
     [HT52]      D.G. Horvitz and D.J. Thompson, A Generalization of
                  Sampling   without replacement from a Finite
                  Universe. J. Amer. Statist. Assoc. Vol. 47, pp.
                  663-685, 1952.
  
     [ISO3309]    International Organization for Standardization,
                  "ISO Information Processing Systems - Data
                  Communication High-Level Data Link Control
                  Procedure - Frame Structure", IS 3309, October
                         rd
                  1984, 3  Edition.
  
  
     [Jenk97]    B. Jenkins: Algorithm Alley, Dr. Dobb's Journal,
                  September 1997.
                  http://burtleburtle.net/bob/hash/doobs.html
  
     [JePP92]    Jonathan Jedwab, Peter Phaal, Bob Pinna: Traffic
                  Estimation for the Largest Sources on a Network,
                  Using Packet Sampling with Limited Storage, HP
                  technical report, Managemenr, Mathematics and
                  Security Department, HP Laboratories, Bristol,
                  March 1992,
                  http://www.hpl.hp.com/techreports/92/HPL-92-35.html
  
     [Knuth98]   Donald E. Knuth: The Art of Computer Programming,
                  Volume 3: Searching and Sorting, Addison Wesley,
                  1998
  
  
  
  
  
  
  Zseby, Molina, Raspall, Duffield   Expires August 2004  [Page 32]


  Internet Draft  Techniques for IP Packet Selection    February 2004
  
  
  
     [MolFl03]   M.Molina: Flow selection support in IPFIX, Internet
                  Draft <draft-molina-flow-selection-00.txt>, work in
                  progress, October 2003.
  
     [Moli03]    M.Molina: A scalable and efficient methodology for
                  flow monitoring in the internet, International
                  Teletraffic Congress (ITC-18), Berlin, Sep. 2003
  
     [NiMD03]    S. Niccolini, M.Molina, N.Duffield: Hash functions
                  description for packet selection, Internet Draft
                  <draft-niccolini-hash-descr-00.txt>, work in
                  progress, October 2003.
  
     [QuZC03]    J. Quittek, T. Zseby, B. Claise, S. Zander:
                  Requirements for IP Flow Information Export,
                  Internet Draft <draft-ietf-ipfix-reqs-13.txt>, work
                  in progress, December 2003
  
     [Zseb03]    T. Zseby: Stratification Strategies for Sampling-
                  based Non-intrusive Measurement of One-way Delay.
                  Passive and Active Measurement Workshop
                  Proceedings, La Jolla, CA, USA, pp. 171-179, Apr.
                  2003
  
  
  9. Author's Addresses
  
     Tanja Zseby
     Fraunhofer Institute for Open Communication Systems
     Kaiserin-Augusta-Allee 31
     10589 Berlin
     Germany
     Phone: +49-30-34 63 7153
     Fax:   +49-30-34 53 8153
     Email: zseby@fokus.fhg.de
  
     Maurizio Molina
     NEC Europe Ltd., Network Laboratories
     Adenauerplatz 6
     69115 Heidelberg
     Germany
     Phone: +49 6221 90511-18
     Email: molina@ccrle.nec.de
  
     Fredric Raspall
     NEC Europe Ltd., Network Laboratories
     Adenauerplatz 6
     69115 Heidelberg
     Germany
  
  
  
  Zseby, Molina, Raspall, Duffield   Expires August 2004  [Page 33]


  Internet Draft  Techniques for IP Packet Selection    February 2004
  
  
  
     Phone: +49 6221 90511-31
     EMail: raspall@ccrle.nec.de
  
     Nick Duffield
     AT&T Labs - Research
     Room B-139
     180 Park Ave
     Florham Park NJ 07932, USA
     Phone: +1 973-360-8726
     Email: duffield@research.att.com
  
  10.
      Intellectual Property Statement
  
     AT&T Corporation may own intellectual property applicable to
     this
     contribution. The IETF has been notified of AT&T's licensing
     intent for the specification contained in this document. See
     http://www.ietf.org/ietf/IPR/ATT-GENERAL.txt for AT&T's IPR
     statement.
  
  11.
      Full Copyright Statement
  
     Copyright (C) The Internet Society (2002). All Rights Reserved.
     This document and translations of it may be copied and furnished
     to others, and derivative works that comment on or otherwise
     explain it or assist in its implementation may be prepared,
     copied, published and distributed, in whole or in part, without
     restriction of any kind, provided that the above copyright
     notice and this paragraph are included on all such copies and
     derivative works. However, this document itself may not be
     modified in any way, such as by removing the copyright notice or
     references to the Internet Society or other Internet
     organizations, except as needed for the purpose of developing
     Internet standards in which case the procedures for copyrights
     defined in the Internet Standards process must be followed, or
     as required to translate it into languages other than
     English.
  
     The limited permissions granted above are perpetual and will not
     be revoked by the Internet Society or its successors or assigns.
  
     This document and the information contained herein is provided
     on an
     "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET
     ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR
     IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE
     OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY
     IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A
     PARTICULAR PURPOSE.
  
  
  
  Zseby, Molina, Raspall, Duffield   Expires August 2004  [Page 34]


  Internet Draft  Techniques for IP Packet Selection    February 2004
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  Zseby, Molina, Raspall, Duffield   Expires August 2004  [Page 35]