Internet Draft
Document: <draft-ietf-psamp-sample-tech-04.txt> T. Zseby
Expires: August 2004 Fraunhofer FOKUS
M. Molina
NEC Europe Ltd.
F. Raspall
NEC Europe Ltd.
N. Duffield
AT&T Labs - Research
February 2004
Sampling and Filtering Techniques for IP Packet Selection
Status of this Memo
This document is an Internet-Draft and is in full conformance
with all provisions of Section 10 of RFC2026.
Internet-Drafts are working documents of the Internet
Engineering Task Force (IETF), its areas, and its working
groups. Note that other groups may also distribute working
documents as Internet-Drafts. Internet-Drafts are draft
documents valid for a maximum of six months and may be updated,
replaced, or obsoleted by other documents at any time. It is
inappropriate to use Internet-Drafts as reference material or to
cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html.
Abstract
This document describes sampling and filtering techniques for IP
packet selection. In two information models (one for sampling,
one for filtering) it defines what parameters are needed to
describe the most common selection schemes and shows how
techniques can be combined to build more elaborate packet
selectors. The information models are used for configuring the
selection technique in measurement processes and for reporting
the technique in use to a collector.
Zseby, Molina, Raspall, Duffield Expires August 2004 [Page 1]
Internet Draft Techniques for IP Packet Selection February 2004
Table of Contents
1. Introduction.................................................2
2. Terminology..................................................3
2.1 General Terminology..........................................3
2.2 Packet Selection Terminology.................................6
3. Scope and Deployment of Packet Selection Techniques..........8
3.1 Sampling....................................................10
3.1.1 Systematic Sampling........................................11
3.1.2 Random Sampling............................................12
3.1.2.1 n-out-of-N Sampling......................................12
3.1.2.2 Probabilistic Sampling...................................12
3.1.2.2.1 Uniform Probabilistic Sampling.........................12
3.1.2.2.2 Non-Uniform Probabilistic Sampling.....................12
3.1.2.2.3 Non-Uniform Flow State Dependent Sampling..............13
3.1.2.2.4 Configuration of non-uniform probabilistic and flow-
state sampling...................................................13
4. Filtering...................................................14
4.1 Mask/Match filtering........................................14
4.2 Hash-based Selection........................................15
4.2.1 Application Examples for Hash-based Selection..............16
4.2.1.1 Approximation of Random Sampling.........................16
4.2.1.2 Consistent Packet Selection..............................16
4.2.2 Guarding Against Pitfalls and Vulnerabilities..............17
4.2.3 Considerations and Recommendations for Hash-functions......18
4.2.4 IP Shift-XOR (IPSX) hash function..........................19
4.2.5 "Bob" hash function........................................20
4.3 Router State filtering......................................23
5. Input Parameters and Information Models.....................24
5.1 Information Model for Sampling Techniques....................25
5.2 Information Model for Filtering Techniques...................26
6. Composite Techniques........................................29
6.1 Cascaded filtering->sampling or sampling->filtering..........29
6.2 Stratified Sampling..........................................30
7. Security Considerations.....................................30
8. References..................................................31
9. Author's Addresses..........................................33
10. Intellectual Property Statement.............................34
11. Full Copyright Statement....................................34
1. Introduction
Increasing data rates and growing measurement demands increase
the requirements for data collection resources. High packet
rates in backbone networks load measurement processes. Demands
for fine granular results (e.g. per flow analysis) require
performant and flexible classification algorithms, which are
usually resource extensive. Furthermore, some measurement
Zseby, Molina, Raspall, Duffield Expires August 2004 [Page 2]
Internet Draft Techniques for IP Packet Selection February 2004
methods require the capturing of packet headers or even parts of
the payload. All this can lead to an overwhelming amount of
measurement data, resulting in high demands regarding resources
for measurement, storage, transport and post processing.
In some cases specialized hardware helps to fulfill these
demands but on the other hand increases the costs for providing
the measurement. Since measurements are mainly a supporting
functionality for the service provisioning, measurement costs
should be limited to a small fraction of the costs of the
network service provisioning itself. A reduction of the data
that is considered and reported by a measurement process is
crucial to prevent the depletion of the available (i.e. the
affordable) resources. Such a reduction can be achieved by a
reasonable deployment of packet selection techniques, that
sample a subset of the packets while still allowing an
appropriate accuracy, or filter out all packets that are not of
interest for the measurement at all. Packet selection helps to
prevent an exhaustion of resources and to limit the measurement
costs. Examples for applications that benefit from packet
selection are given in [Du04].
2. Terminology
The PSAMP terminology resulted from joint discussions of the
authors of this document together with the authors of [Du04].
Therefore all terms used throughout this document represent the
common understanding of the authors of both documents and are
consistent with those defined in [Du04]. Furthermore, it is
aimed at consistency with the terms used in [QuZC03].
2.1 General Terminology
* Observation Point: a location in the network where a packet
stream is observed. Examples include:
(i) a line to which a probe is attached;
(ii) a shared medium, such as an Ethernet-based LAN;
(iii) a single port of a router, or set of interfaces
(physical or logical) of a router;
(iv) an embedded measurement subsystem within an interface.
* Observed Packet Stream: the set of all packets observed at the
observation point.
Zseby, Molina, Raspall, Duffield Expires August 2004 [Page 3]
Internet Draft Techniques for IP Packet Selection February 2004
* Packet Stream: either the observed packet stream, or a subset
of it.
Note that packets selected from a stream, e.g. by sampling, do
not necessarily possess a property by which they can be
distinguished from packets that have not been selected. For
this reason the term ôstreamö is favored over ôflowö, which is
defined as set of packets with common properties [QuZC02].
* Selection Process: takes a packet stream as its input and
selects a subset of that stream as its output.
* Packet Content: the union of the packet header (which includes
link layer, network layer and other encapsulation headers) and
the packet payload.
* Selection State: a selection process may maintain state
information for use by the selection process and/or the
reporting process. At a given time, the selection state may
depend on packets observed at and before that time, and other
variables. Examples include:
(i) sequence numbers of packets at the input of selectors;
(ii) a timestamp of observation of the packet at the
observation points;
(iii) iterators for pseudorandom number generators;
(iv) hash values calculated during selection;
(v) indicators of whether the packet was selected by a
given selector;
Selection processes may change portions of the selection state
as a result of processing a packet.
* Selector: the component that performs a selection process on a
single packet of its input. A selected packet becomes an
element of the output packet stream of the selection process.
The selector can make use of the following information in
determining whether a packet is selected:
(i) the packetÆs content;
(ii) information derived from the packet's treatment at the
observation point;
Zseby, Molina, Raspall, Duffield Expires August 2004 [Page 4]
Internet Draft Techniques for IP Packet Selection February 2004
(iii) any selection state that may be maintained by the
selection process.
* Composite Selection Process: an ordered composition of
selection processes, in which the output stream issuing from
one component forms the input stream for the succeeding
component.
* Primitive Selection Process: a selection process that is not a
composite selection process.
* Composite Selector: the selector of a composite selection
process.
* Primitive Selector: the selector of a primitive selection
process.
* Reporting Process: creates a report stream on packets selected
by a selection process, in preparation for export. The input
to the reporting process comprises that information available
to the selection process per selected packet, specifically:
(i) the selected packetÆs content or parts of it;
(ii) information derived from the selected packet's
treatment at the observation point;
(iii) any selection state maintained by the inputting
selection process, reflecting any modifications to the
selection state made during selection of the packet.
* Report Stream: the output of a reporting process is a report
stream, comprising two distinguished types of information:
packet reports, and report interpretation.
* Packet Reports: a configurable subset of the per packet input
to the reporting process.
* Report Interpretation: subsidiary information relating to one
or more packets, that is used for interpretation of their
packet reports. Examples include configuration parameters of
the selection process and of the reporting process.
* Measurement Process: the composition of a selection process
that takes the observed packet stream as its input, followed
by a reporting process.
* Export Process: sends the output of one or more reporting
processes to one or more collectors.
Zseby, Molina, Raspall, Duffield Expires August 2004 [Page 5]
Internet Draft Techniques for IP Packet Selection February 2004
* Collector: a collector receives a report stream exported by
one or more export processes. In some cases, the host of the
measurement and/or export processes may also serve as the
collector.
* Export packets: one or more packet reports, and perhaps report
interpretation, are bundled by the export process into a
export packet for export to a collector.
Various possibilities for the high level architecture of these
elements are as follows.
MP = Measurement Process, EP = Export Process
+---------------------+ +------------------+
|Observation Point(s) | | Collector(1) |
|MP(s)--->EP----------+---------------->| |
|MP(s)--->EP----------+-------+-------->| |
+---------------------+ | +------------------+
|
+---------------------+ | +------------------+
|Observation Point(s) | +-------->| Collector(2) |
|MP(s)--->EP----------+---------------->| |
+---------------------+ +------------------+
+---------------------+
|Observation Point(s) |
|MP(s)--->EP---+ |
| | |
|Collector(3)<-+ |
+---------------------+
The PSAMP measurement process can be viewed as analogous to the
IPFIX metering process. The PSAMP measurement process takes an
observed packet stream as its input, and produces packet reports
as its output. The IPFIX metering process produces flow records
as its output. The distinct name ômeasurement processö has been
retained in order to avoid potential confusion in settings where
IPFIX and PSAMP coexist, and in order to avoid the implicit
requirement that the PSAMP version satisfy the requirements of
an IPFIX metering process (at least while these are under
development). The relation between PSAMP and IPFIX is further
discussed in [QC03].
2.2 Packet Selection Terminology.
* Filtering: a filter is a selection operation that selects a
packet deterministically based on the packet content, its
Zseby, Molina, Raspall, Duffield Expires August 2004 [Page 6]
Internet Draft Techniques for IP Packet Selection February 2004
treatment, and functions of these occurring in the selection
state. Examples include mask/match filtering, and hash-based
selection.
* Sampling: a selection operation that is not a filter is called
a sampling operation. This reflects the intuitive notion that
if the selection of a packet cannot be determined from its
content alone, there must be some type of sampling taking
place.
* Content-independent Sampling: a sampling operation that does
not use packet content (or quantities derived from it) as the
basis for selection is called a content-independent sampling
operation. Examples include systematic sampling, and uniform
pseudorandom sampling driven by a pseudorandom number whose
generation is independent of packet content. Note that in
content-independent sampling it is not necessary to access the
packet content in order to make the selection decision.
* Content-dependent Sampling: a sampling operation where
selection is dependent on packet content is called a content-
dependent sampling operation. Examples include pseudorandom
selection according to a probability that depends on the
contents of a packet field. Note that this is not a filter ,
because the selection is not deterministic..
* Hash Domain: a subset of the packet content and the packet
treatment, viewed as an N-bit string for some positive integer
N.
* Hash Range: a set of M-bit strings for some positive integer
M.
* Hash Function: a deterministic map from the hash domain into
the hash range.
* Hash Selection Range: a subset of the hash range. The packet
is selected if the action of the hash function on the hash
domain for the packet yields a result in the hash selection
range.
* Hash-based Selection: filtering specified by a hash domain, a
hash function, and hash range and a hash selection range.
* Approximative Selection: selection operations in any of the
above categories may be approximated by operations in the same
or another category for the purposes of implementation. For
example, uniform pseudorandom sampling may be approximated by
hash-based selection, using a suitable hash function and hash
Zseby, Molina, Raspall, Duffield Expires August 2004 [Page 7]
Internet Draft Techniques for IP Packet Selection February 2004
domain. In this case, the closeness of the approximation
depends on the choice of hash function and hash domain.
* Population size: the number of all packets in the packet
stream (or subset) for which a metric should be estimated
* Sample size: the number of packets selected from the
population by a selection operation.
* Attained Selection Frequency: the actual frequency with which
packets are selected by a selection process. The attained
sampling frequency is calculated as ratio of the size of a
sample size to the size of the population from which it was
selected.
* Target Selection Frequency: the long-term frequency with which
packets are expected to be selected, based on selector
parameter settings. Depending on the selector, the target
selection frequency may be count-based or time-based.
Due to the inherent statistical variability of sampling
decisions, the target and attained selection frequencies can
differ (e.g. for probabilistic sampling and hash-based
selection). Nevertheless, for large population sizes and
properly configured sampling schemes the attained selection
frequency usually approaches the target selection frequency.
In hash-based selection, the target selection frequency is the
quotient of size of the hash selection range by the size of
the hash range.
3. Scope and Deployment of Packet Selection Techniques
Packet selection techniques generate a subset of packets from an
Observed Packet Stream at an observation point. We distinguish
between sampling and filtering.
Sampling is targeted at the selection of a representative subset
of packets. The subset is used to infer knowledge about the
whole set of observed packets without processing them all. The
selection can depend on packet position, and/or on packet
content, and/or on (pseudo) random decisions.
Filtering selects a subset with common properties. This is used
if only a subset of packets is of interest. The properties can
be directly derived from the packet content, or depend on the
treatment given by the router to the packet. Filtering is a
deterministic operation. It depends on packet content or router
treatment. It never depends on packet position or on (pseudo)
random decisions.
Zseby, Molina, Raspall, Duffield Expires August 2004 [Page 8]
Internet Draft Techniques for IP Packet Selection February 2004
Note that a common technique to select packets is to compute a
Hash Function on some bits of the packet header and/or content
and to select it if the Hash Value falls in the Hash Selection
Range. Since hashing is a deterministic operation on the packet
content, it is a filtering technique according to our
categorization. Nevertheless, hash functions are sometimes used
to approximate random sampling. Depending on the chosen input
bits, the Hash Function and the Hash Selection Range, this
technique can be used to approximate the random selection of
packets with a given probability p. It is also a powerful
technique to consistently select the same packet subset at
multiple observation points [DuGr00]
The following table gives an overview of the schemes described
in this document and their categorization. An X in brackets (X)
denotes schemes for which also content-independent variants
exist. It easily can be seen that only schemes with both
properties, content dependence and deterministic selection, are
considered as filters.
Selection Scheme | deterministic | content- | Category
| selection | dependent|
------------------------+---------------+----------+----------
systematic | X | _ | Sampling
count-based | | |
------------------------+---------------+----------+----------
systematic | X | - | Sampling
time-based | | |
------------------------+---------------+----------+----------
random | - | - | Sampling
n-out-of-N | | |
------------------------+---------------+----------+----------
random | - | - | Sampling
uniform probabilistic | | |
------------------------+---------------+----------+----------
random | - | (X) | Sampling
non-uniform probabil. | | |
------------------------+---------------+----------+----------
random | - | (X) | Sampling
non-uniform flow-state | | |
------------------------+---------------+----------+----------
mask/match filter | X | X | Filter
------------------------+---------------+----------+----------
hash function | X | X | Filter
------------------------+---------------+----------+----------
router state filter | X | (X) | Filter
Zseby, Molina, Raspall, Duffield Expires August 2004 [Page 9]
Internet Draft Techniques for IP Packet Selection February 2004
------------------------+---------------+----------+----------
The introduced categorization is mainly useful for the
definition of an information model describing Primitive
Selectors . More complex selection techniques can be described
through the composition of cascaded sampling and filtering
operations.
For example, a packet selection that weights the selection
probability on the basis of the packet length can be described
as a cascade of a filter and a sampling scheme. However, this
descriptive approach is not intended to be rigid: if a common
and consolidated selection practice turns out to be too complex
to be described as a composition of the mentioned building
blocks, an ad hoc description can be specified instead and added
as a new scheme to the information model.
Packet selectors can be part of an IPFIX metering process and
can also use the IPFIX exporting process. This is expressed as
association to one or more IPFIX processes.
3.1 Sampling
The deployment of sampling techniques aims at the provisioning
of information about a specific characteristic of the parent
population at a lower cost than a full census would demand. In
order to plan a suitable sampling strategy it is therefore
crucial to determine the needed type of information and the
desired degree of accuracy in advance.
First of all it is important to know the type of metric that
should be estimated. The metric of interest can range from
simple packet counts [JePP92] up to the estimation of whole
distributions of flow characteristics (e.g. packet
sizes)[ClPB93].
Secondly, the required accuracy of the information and with
this, the confidence that is aimed at, should be known in
advance. For instance for usage-based accounting the required
confidence for the estimation of packet counters can depend on
the monetary value that corresponds to the transfer of one
packet. That means that a higher confidence could be required
for expensive packet flows (e.g. premium IP service) than for
cheaper flows (e.g. best effort). The accuracy requirements for
validating a previously agreed quality can also vary extremely
with the customer demands. These requirements are usually
determined by the service level agreement (SLA).
Zseby, Molina, Raspall, Duffield Expires August 2004 [Page 10]
Internet Draft Techniques for IP Packet Selection February 2004
The sampling method and the parameters in use must be clearly
communicated to all applications that use the measurement data.
Only with this knowledge a correct interpretation of the
measurement results can be ensured.
Sampling methods can be characterized by the sampling algorithm,
the trigger type used for starting a sampling interval and the
length of the sampling interval. These parameters are described
here in detail. The sampling algorithm describes the basic
process for selection of samples. In accordance to [AmCa89] and
[ClPB93] we define the following basic sampling processes:
3.1.1 Systematic Sampling
Systematic sampling describes the process of selecting the start
points and the duration of the selection intervals according to
a deterministic function. This can be for instance the periodic
selection of every k-th element of a trace but also the
selection of all packets that arrive at pre-defined points in
time. Even if the selection process does not follow a periodic
function (e.g. if the time between the sampling intervals varies
over time) we consider this as systematic sampling as long as
the selection is deterministic.
The use of systematic sampling always involves the risk of
biasing the results. If the systematics in the sampling process
resemble systematics in the observed stochastic process
(occurrence of the characteristic of interest in the network),
there is a high probability that the estimation will be biased.
Systematics in the observed process might not be known in
advance.
Here only equally spaced schemes are considered, where triggers
for sampling are periodic, either in time or in packet count.
All packets occurring in a selection interval (either in time or
packet count) beyond the trigger are selected.
Systematic count-based
In systematic count-based sampling the start and stop triggers
for the sampling interval are defined in accordance to the
spatial packet position (packet count).
Systematic time-based
In systematic count-based sampling the start and stop triggers
for the sampling interval are defined in accordance to the
temporal packet position (arrival time).
Both schemes are contentûindependent selection schemes. Content
dependent deterministic selectors are categorized as filter.
Zseby, Molina, Raspall, Duffield Expires August 2004 [Page 11]
Internet Draft Techniques for IP Packet Selection February 2004
3.1.2 Random Sampling
Random sampling selects the starting points of the sampling
intervals in accordance to a random process. The selection of
elements are independent experiments. With this, unbiased
estimations can be achieved. In contrast to systematic sampling,
random sampling requires the generation of random numbers. One
can differentiate two methods of random sampling:
3.1.2.1 n-out-of-N Sampling
In n-out-of-N sampling n elements are selected out of the parent
population that consists of N elements. One example would be to
generate n different random numbers in the range [1,N] and
select all packets which have a packet position equal to one of
the random numbers. For this kind of sampling the sample size n
is fixed.
3.1.2.2 Probabilistic Sampling
In probabilistic sampling the decision whether an element is
selected or not is made in accordance to a pre-defined selection
probability. An example would be to flip a coin for each packet
and select all packets for which the coin showed the head. For
this kind of sampling the sample size can vary for different
trials. The selection probability does not necessarily has to be
the same for each packet. Therefore we distinguish between
uniform probabilistic sampling (with the same selection
probability for all packets) and non-uniform probabilistic
sampling (where the selection probability can vary for different
packets).
3.1.2.2.1 Uniform Probabilistic Sampling
For Uniform Probabilistic Sampling packets are selected
independently with a uniform probability p. This sampling can be
count-driven, and is sometimes referred to as geometric random
sampling, since the difference in count between successive
selected packets are independent random variables with a
geometric distribution of mean 1/p. A time-driven analog,
exponential random sampling, has the time between triggers
exponentially distributed.
Both geometric and exponential random sampling are examples of
what is known as additive random sampling, defined as sampling
where the intervals or counts between successive samples are
independent identically distributed random variable.
3.1.2.2.2 Non-Uniform Probabilistic Sampling
Zseby, Molina, Raspall, Duffield Expires August 2004 [Page 12]
Internet Draft Techniques for IP Packet Selection February 2004
This is a variant of Probabilistic Sampling in which the
sampling probabilities can depend on the selection process
input. This can be used to weight sampling probabilities in
order e.g. to boost the chance of sampling packets that are rare
but are deemed important. Unbiased estimators for quantitative
statistics are recovered by renormalization of sample values;
see [HT52].
3.1.2.2.3 Non-Uniform Flow State Dependent Sampling
Another type of sampling that can be classified as probabilistic
Non-Uniform is closely related to the flow concept as defined
in [QuZC02], and it is only used jointly with a flow monitoring
function (IPFIX monitoring function). Packets are selected,
dependent on a selection state. The point, here, is that the
selection state is determined also by the state of the flow the
packet belongs to and/or by the state of the other flows
currently being monitored by the associated flow monitoring
function. An example for such an algorithm is the ösample and
holdö method described in [EsVa01]:
- If a packet accounts for a flow record that already exists in
the IPFIX flow recording process, it is selected (i.e. the
flow record is updated)
- If a packet doesn't account to any existing flow record, it is
selected with probability p. If it has been selected a new
flow record has to be created.
A further algorithm that fits into the category of non-uniform
flow state dependent sampling is described in [Moli03].
This type of sampling is content dependent because the
identification of the flow the packet belongs to requires
analyzing part of the packet content. If the packet is selected,
then it is passed as an input to the IPFIX monitoring function
(this is called öLocal Exportö in [Du04]
Selecting the packet depending on the state of a flow cache is
useful when memory resources of the flow monitoring function are
scarce (i.e. thereÆs no room to keep all the flows that have
been scheduled for monitoring). See [MolFl03] for a more
detailed description of the motivations for this type of
sampling and the impact on the IPFIX metering.
3.1.2.2.4 Configuration of non-uniform probabilistic and
flow-state sampling
Zseby, Molina, Raspall, Duffield Expires August 2004 [Page 13]
Internet Draft Techniques for IP Packet Selection February 2004
Many different specific methods can be grouped under the terms
non-uniform probabilistic and flow state sampling. Dependent on
the sampling goal and the implemented scheme, a different number
and type of input parameters is required to configure such
scheme.
Some concrete proposals for such methods exist from the research
community (e.g. [EsVa01],[DuLT01],[Moli03]). All of these
proposals are still in an early stage and need further
investigations to prove their usefulness and applicability.
Since we donÆt want give preference to one of the existing early
stage methods we here only describe the basic methods and leave
the specification of explicit schemes and their parameters up to
vendors (e.g. as extension of the information model).
4. Filtering
Filtering is the deterministic selection of packets based on the
packet content, the treatment of the packet at the observation
point, or deterministic functions of these occurring in the
selection state. The packet is selected if these quantities fall
into a specified range.
The role of filtering, as the word itself suggest, is to
separate all the packets having a certain property from those
not having it. A distinguishing characteristic from sampling is
that the property never depends on the packet position in time
or in the space, or on a random process.
We identify and describe in the following three filtering
techniques. The first two (Mask/Match filtering and Hashing
filtering) are stateless, and therefore can make their decision
based on the analysis of portion of the packet only. The other
(router state filtering) requires to access state information
after the analysis of part of the packet and is therefore more
complex: its usage makes sense only in particular circumstances,
as described below.
4.1 Mask/Match filtering
This type of filtering selects a packet operating as follows:
A number of bit positions are chosen in accordance to the
filtering goal. A mask is applied with a logical AND to the
incoming packet to keep only the chosen bits. The result of this
operation is then compared to a predefined single value (e.g. a
specific source IP address), a set of values or a range. The
packet is selected in accordance to the result of this
comparison.
Zseby, Molina, Raspall, Duffield Expires August 2004 [Page 14]
Internet Draft Techniques for IP Packet Selection February 2004
The selected bits of the packet arenÆt necessarily only those of
the IP header. If bits of the IP header and bits of the payload
are considered, the masks and the selection intervals MUST be
specified separately for the header and the payload.
An implementation is not constrained to apply exactly all the
steps or in this sequence, provided that the result is
equivalent to a logical function doing it.
Examples of filters of this class are filters that select
packets based on the matching of some of the IP header fields
with a (possibly masked) specific value, filters that select
packets having some IP header fields values falling within a
range, filters that do the same as above on some of the
transport header fields (that are thus considered as part of the
payload), or a combination of all of the above mentioned
possibilities.
4.2 Hash-based Selection
A Hash Function h maps the packet content c, or some portion of
it, onto a Hash Range R. The packet is selected if h(c) is an
element of S, which is a subset of R called the Hash Selection
Range. Thus hash-based sampling is indeed a particular case of
filtering: the object is selected if c is in inv(h(S)). But for
desirable Hash Functions the inverse image inv(h(S)) will be
extremely complex, and hence h would not be expressible as, say,
a match/mask filter or a simple combination of these.
Hash-based selection has mainly two types of usage: it offers a
way to approximate random sampling by using packet content to
generate pseudorandom variates and a way to consistently select
subsets of packets that share a common property (e.g. at
different observation points).
In the following subsections we give more details about them.
However, both usages require that the Hash Functions has two
statistical properties. First, the hash function h must have
good mixing properties, in the sense that small changes in the
input (e.g. the flipping of a single bit) cause large changes in
the output (many bits change). Then any local clump of values of
c is spread widely over R by h, and so the distribution of h(c)
is fairly uniform even if the distribution of c is not. Then the
sampling rate is #S/#R, which can be tuned by choice of S. If S
and R are sets contiguous integers, h(c), suitably shifted and
normalized, can be interpreted as a pseudorandom variate. The
second desirable property depends more closely on the statistics
of the content c. In applications, the content c comprises a
number of distinct fields, c1 ... cm, e.g. source and
destination IP Address, IP identification, and TCP/UDP port
numbers (if present) for a packet. With a hash function
satisfying the first properties above, selection decisions will
Zseby, Molina, Raspall, Duffield Expires August 2004 [Page 15]
Internet Draft Techniques for IP Packet Selection February 2004
appear uncorrelated with the contents of any individual field,
if the complementary fields are (i) sufficiently variable
themselves, and (ii) sufficiently uncorrelated with cj.
4.2.1 Application Examples for Hash-based Selection
4.2.1.1 Approximation of Random Sampling
Although pseudorandom number generators with well understood
properties have been developed, they may not be the method of
choice in setting where computational resources are scarce. A
convenient alternative is to use hash functions of packet
content as a source of randomness. The hash (suitably
renormalized) is a pseudorandom variate in the interval [0,1].
Other schemes may use packet fields in iterators for
pseudorandom numbers.
The point here, is that the statistical properties of the
idealized packet selection law (such as independence of sampling
decisions for different packets, or independence on packet
content) may not be exactly shared by an implementation, but
only approximately so.
Although the selection decisions for non-uniform probabilistic
sampling (see Section 3.1.2.2.2 above) also depend on the packet
content, this form of sampling is distinguished from the use of
packet content to generate variates. In the former case, the
content only determines the selection probabilities: selection
could then proceed e.g by use of a variates obtained by an
independent pseudorandom number generator. In the latter case,
the content determines the pseudorandom variates rather than the
probabilities.
4.2.1.2 Consistent Packet Selection
In Consistent packet selection, all routers in a network hash
parts of the packet content using identical hash function and
selection range. The domain of the hash is restricted to those
parts of the packet that are invariant from hop to hop. Fields
such as Time-to-Live, which is decremented per hop, and header
CRC, which is recalculated per hop, are thus excluded from the
hash domain. Thus, a given packet is selected at either all
points on its path through the network, or at none. The domain
of the hash function needs to be wider than just a flow key, if
packets are to be selected quasi-randomly within flows (and e.g.
include portions of the payload), see [DuGr00].
If a report on each selected packet is exported to a collector,
the collector can reconstruct trajectories of the selected
packets, provided it can match different reports on the same
Zseby, Molina, Raspall, Duffield Expires August 2004 [Page 16]
Internet Draft Techniques for IP Packet Selection February 2004
packet, and distinguish these from reports on other packets. For
this purpose a second distinct hash function may be used to
generate a label or digest of each selected packet for inclusion
in its packet report. The benefit of using a digest is reduction
in reporting bandwidth. Routers need only report the digest
provided at least one router (say an edge router) sends a full
packet report containing packet fields of interest in addition
to the digest. A digest size of 32 bits has been found
sufficient to distinguish packets; see [DuGeGr02].
Applications of consistent packet selection include (i)
estimation of the network path matrix, i.e., the traffic
intensities according to network path, broken down by flow; (ii)
detection of routing loops, as indicated by self-intersecting
trajectories; (iii) passive performance measurement: prematurely
terminating trajectories indicate packet loss, packet one way
delay can be determined if reports include (synchronized)
timestamps; (iv) network attack tracing, through determination
of the actual paths taken by attack packets with spoofed source
addresses.
4.2.2 Guarding Against Pitfalls and Vulnerabilities
A concern is whether some large set of related packets could be
sampled at a rate that significantly differs from the expected
sampling rate, either (i) through unanticipated behavior in the
hash function, or (ii) because the packets had been deliberately
crafted to have this property.
The first point underlines the importance of using a hash
function with good mixing properties. Examples of such are CRC32
and hash functions based on modular arithmetic, see 6.4 in
[Knuth98]. The statistical properties of candidate hash
functions need to be evaluated, preferably on packet before
adoption for hash-based sampling
Hash sampling could be overloaded (or evaded) by an attacker if
the hash function and the selection rate are both known. A
service provider could build a first defense keeping the Hash
Selection Range S private. Then, an attacker could not determine
whether a crafted packet is selected, but he would still know
that a crafted set of packets all with the same hash is either
all selected or all not selected. Moreover, when applications
(like multi domain trajectory sampling, or one way delay
estimation across multiple domains) may require multiple
administrative entities to agree on a common hash function and
selection range, mutual trust between the entities cannot be
avoided and just keeping S secret may not be feasible. A
stronger defense is to employ a parametrizable hash function and
keep the parameter private: in this case, the set of hash values
Zseby, Molina, Raspall, Duffield Expires August 2004 [Page 17]
Internet Draft Techniques for IP Packet Selection February 2004
of the packets could not be determined. Examples of parameters
are the initial vector in CRC32, and moduli in hashes based on
modular arithmetic.
4.2.3 Considerations and Recommendations for Hash
Functions.
For hash-based selection, the most important requirements for a
hash function are simplicity of implementation and speed of
operation, followed by uniformity of hash distribution.
Simplicity promotes wider implementation and the ability to
operate at line speed. Uniformity of hash promotes the
approximation of random sampling by hash-based selection.
For a packet digesting hash, uniformity of hash distribution and
a small collision probability are the most important
requirements. Since the digest need only be computed over the
substream of selected packets, a digesting hash does not have
the same speed requirements as a sampling hash. Thus digesting
enables and benefits from the use of more complex hash functions
than hash-based selection.
The properties of a number of different hash functions were
evaluated and compared. On this basis, the following
recommendations are made for PSAMP hash functions.
When hash-based packet selection is supported, the following
hash functions MUST be offered:
* The IPSX hash function; see Section 4.2.4 below. IPSX is a
fast hash function with good uniformity properties, and is
intended for hash-based selection. It operates with fixed
length input tailored for IPv4 packets, and produces a 16 bit
output, enabling sampling down to rates of 1 in 65536.
* The CRC32 hash function; see [ISO3309]. CRC32 has small
collision probabilities, its 32 bit output being sufficiently
large to function as a digest. Its use as a selection hash is
not precluded, however it is roughly an order of magnitude
slower than IPSX and so is not expected to function as widely
for this purpose. It can accept a variable length input and so
provides a potential future path for hash-based selection of
packets of protocols other than IPv4.
Other hash functions MAY be provided. A candidate hash function
is the BOB hash function described in Section 4.2.5 below. Its
performance in speed, uniformity and collisions are comparable
with (and slightly superior to) CRC32. As remarked in Section 9
Zseby, Molina, Raspall, Duffield Expires August 2004 [Page 18]
Internet Draft Techniques for IP Packet Selection February 2004
of [Du04], the MD5 hash function may be precluded for ubiquitous
use at full line rate, at least for hash-based selection.
4.2.4 IP Shift-XOR (IPSX) hash function
The IPSX hash function is tailored for acting on IP version 4
packets. It exploits the structure of IP packet and in
particular the variability expected to be exibited within
different fields of the IP packet in order to furnish a hash
value with little apparent correlation with individual packet
fields. Fields from the IPv4 and TCP/UDP headers are used as
input. The IPSX hash function uses a small number of simple
instructions.
Input parameters: None
Built-in parameters: None
Output: The output of the IPSX is a 16 bit number
Functioning:
The functioning can be divided into two parts: input selection,
which forms are composite input from various portions of the IP
packet, followed by computation of the hash on the composite.
Input Selection:
The raw input is drawn from the first 20 bytes of the IP packet
header and the first 8 bytes of the IP payload. If IP options
are not used, the IP header has 20 bytes, and hence the two
portions adjoin and comprise the first 28 bytes of the IP
packet. We now use the raw input as 4 32-bit subportions of
these 28 bytes. We specify the input by bit offsets from the
start of IP header or payload.
f1 = bits 32 to 63 of the IP header, comprising the IP
identification field, flags, and fragment offset.
f2 = bits 96 to 127 of the IP header, the source IP address.
f3 = bits 128 to 159 of the IP header, the destination IP
address.
f4 = bits 32 to 63 of the IP payload. For a TCP packet, f4
comprises the TCP sequence number followed by the message
length. For a UDP packet f4 comprises the UDP checksum.
Hash Computation:
Zseby, Molina, Raspall, Duffield Expires August 2004 [Page 19]
Internet Draft Techniques for IP Packet Selection February 2004
The hash is computed from f1, f2, f3 and f4 by a combination of
XOR (^), right shift (>>) and left shift (<<) operations. The
intermediate quantities h1, v1, v2 are 32-bit numbers.
1. v1 = f1 ^ f2;
2. v2 = f3 ^ f4;
3. h1 = v1 << 8;
4. h1 ^= v1 >> 4;
5. h1 ^= v1 >> 12;
6. h1 ^= v1 >> 16;
7. h1 ^= v2 << 6;
8. h1 ^= v2 << 10;
9. h1 ^= v2 << 14;
10. h1 ^= v2 >> 7;
The output of the hash is the least significant 16 bits of h1.
4.2.5 "Bob" hash function
"Bob" hash function is a hash function designed for having each
bit of the input affecting every bit of the return value and
using both 1-bit and 2-bit deltas to achieve the so called
avalanche effect [Jenk97]. This function was originally built
for hash table lookup with fast software implementation.
Input Parameters:
The input parameters of such a function are:
- the length of the input string (key) to be hashed, in
bytes. The elementary input blocks of Bob hash are the single
bytes, therefore no padding is needed.
- an init value (an arbitrary 32-bit number).
Built in parameters:
The Bob Hash uses the following built-in parameter:
- the golden ratio (an arbitrary 32-bit number used in the hash
function computation: its purpose is to avoid mapping all zeros
to all zeros);
Note: the mix sub-function (see mix (a,b,c) macro in the
reference code in 3.2.4) has a number of parameters governing
the shifts in the registers. The one presented is not the only
possible choice.
It is an open point whether these may be considered additional
built-in parameters to specify at function configuration.
Output.
Zseby, Molina, Raspall, Duffield Expires August 2004 [Page 20]
Internet Draft Techniques for IP Packet Selection February 2004
The output of the BOB function is a 32-bit number. It should be
specified:
- A 32 bit mask to apply to the output
- The selection range as a list of non overlapping intervals
[start value, end value] where value is in [0,2^32]
Functioning:
The hash value is obtained computing first an initialization of
an internal state (composed of 3 32-bit numbers, called a, b, c
in the reference code below), then, for each input byte of the
key the internal state is combined by addition and mixed using
the mix sub-function. Finally, the internal state mixed one last
time and the third number of the state (c) is chosen as the
return value.
typedef unsigned long int ub4; /* unsigned 4-byte
quantities */
typedef unsigned char ub1; /* unsigned 1-byte
quantities */
#define hashsize(n) ((ub4)1<<(n))
#define hashmask(n) (hashsize(n)-1)
/* ------------------------------------------------------
mix -- mix 3 32-bit values reversibly.
For every delta with one or two bits set, and the deltas of
all three high bits or all three low bits, whether the original
value of a,b,c is almost all zero or is uniformly distributed,
* If mix() is run forward or backward, at least 32 bits in
a,b,c have at least 1/4 probability of changing.
* If mix() is run forward, every bit of c will change between
1/3 and 2/3 of the time. (Well, 22/100 and 78/100 for some 2-
bit deltas.) mix() was built out of 36 single-cycle latency
instructions in a structure that could supported 2x parallelism,
like so:
a -= b;
a -= c; x = (c>>13);
b -= c; a ^= x;
b -= a; x = (a<<8);
c -= a; b ^= x;
c -= b; x = (b>>13);
...
Unfortunately, superscalar Pentiums and Sparcs can't take
advantage of that parallelism. They've also turned some of
those single-cycle latency instructions into multi-cycle latency
instructions
------------------------------------------------------------*/
Zseby, Molina, Raspall, Duffield Expires August 2004 [Page 21]
Internet Draft Techniques for IP Packet Selection February 2004
#define mix(a,b,c) \
{ \
a -= b; a -= c; a ^= (c>>13); \
b -= c; b -= a; b ^= (a<<8); \
c -= a; c -= b; c ^= (b>>13); \
a -= b; a -= c; a ^= (c>>12); \
b -= c; b -= a; b ^= (a<<16); \
c -= a; c -= b; c ^= (b>>5); \
a -= b; a -= c; a ^= (c>>3); \
b -= c; b -= a; b ^= (a<<10); \
c -= a; c -= b; c ^= (b>>15); \
}
/* -----------------------------------------------------------
hash() -- hash a variable-length key into a 32-bit value
k : the key (the unaligned variable-length array of bytes)
len : the length of the key, counting by bytes
initval : can be any 4-byte value
Returns a 32-bit value. Every bit of the key affects every bit
of the return value. Every 1-bit and 2-bit delta achieves
avalanche. About 6*len+35 instructions.
The best hash table sizes are powers of 2. There is no need to
do mod a prime (mod is sooo slow!). If you need less than 32
bits, use a bitmask. For example, if you need only 10 bits, do
h = (h & hashmask(10));
In which case, the hash table should have hashsize(10) elements.
If you are hashing n strings (ub1 **)k, do it like this:
for (i=0, h=0; i<n; ++i) h = hash( k[i], len[i], h);
By Bob Jenkins, 1996. bob_jenkins@burtleburtle.net. You may
use this code any way you wish, private, educational, or
commercial. It's free.
See http://burtleburtle.net/bob/hash/evahash.html
Use for hash table lookup, or anything where one collision in
2^^32 is acceptable. Do NOT use for cryptographic purposes.
----------------------------------------------------------- */
ub4 bob_hash(k, length, initval)
register ub1 *k; /* the key */
register ub4 length; /* the length of the key */
register ub4 initval; /* an arbitrary value */
{
register ub4 a,b,c,len;
/* Set up the internal state */
len = length;
Zseby, Molina, Raspall, Duffield Expires August 2004 [Page 22]
Internet Draft Techniques for IP Packet Selection February 2004
a = b = 0x9e3779b9; /*the golden ratio; an arbitrary value
*/
c = initval; /* another arbitrary value */
/*------------------------------------ handle most of the key */
while (len >= 12)
{
a += (k[0] +((ub4)k[1]<<8) +((ub4)k[2]<<16)
+((ub4)k[3]<<24));
b += (k[4] +((ub4)k[5]<<8) +((ub4)k[6]<<16)
+((ub4)k[7]<<24));
c += (k[8] +((ub4)k[9]<<8)
+((ub4)k[10]<<16)+((ub4)k[11]<<24));
mix(a,b,c);
k += 12; len -= 12;
}
/*---------------------------- handle the last 11 bytes */
c += length;
switch(len) /* all the case statements fall through*/
{
case 11: c+=((ub4)k[10]<<24);
case 10: c+=((ub4)k[9]<<16);
case 9 : c+=((ub4)k[8]<<8);
/* the first byte of c is reserved for the length */
case 8 : b+=((ub4)k[7]<<24);
case 7 : b+=((ub4)k[6]<<16);
case 6 : b+=((ub4)k[5]<<8);
case 5 : b+=k[4];
case 4 : a+=((ub4)k[3]<<24);
case 3 : a+=((ub4)k[2]<<16);
case 2 : a+=((ub4)k[1]<<8);
case 1 : a+=k[0];
/* case 0: nothing left to add */
}
mix(a,b,c);
/*-------------------------------- report the result */
return c;
}
4.3 Router State filtering
This class of filters select a packet on the basis of router
state conditions. The following list gives examples for such
Zseby, Molina, Raspall, Duffield Expires August 2004 [Page 23]
Internet Draft Techniques for IP Packet Selection February 2004
conditions. Conditions can be combined with AND, OR or NOT
operators.
- Ingress interface at which the packet arrives equals a
specified value
- Egress interface to which the packet is routed equals a
specified value
- Packet violated Access Control List (ACL) on the router
- Reverse Path Forwarding (RPF) failed for the packet
- Resource Reservation is insufficient for the packet
- No route found for the packet
- Origin AS equals a specified value or lies within a given
range
- Destination AS equals a specified value or lies within a
given range
Router architectural considerations may preclude some
information concerning the packet treatment, e.g. routing state,
being available at line rate for selection of packets. However,
if selection not based on routing state has reduced down from
line rate, subselection based on routing state may be feasible.
5. Input Parameters and Information Models
This section gives an overview of different alternative
selection schemes and their required parameters. In order to be
compliant with PSAMP it is sufficient to implement one of the
proposed schemes.
The decision whether to select a packet or not is based on a
function which is performed when the packet arrives at the
selection process. Packet selection schemes differ in the input
parameters for the selection process and the functions they
require to do the packet selection. The following table gives an
overview.
Scheme | input parameters | functions
---------------+------------------------+-------------------
systematic | packet position | packet counter
count-based | sampling pattern |
---------------+------------------------+-------------------
systematic | arrival time | clock or timer
time-based | sampling pattern |
---------------+------------------------+-------------------
random | packet position | packet counter,
n-out-of-N | sampling pattern | random numbers
| (random number list) |
---------------+------------------------+-------------------
uniform | sampling | random function
Zseby, Molina, Raspall, Duffield Expires August 2004 [Page 24]
Internet Draft Techniques for IP Packet Selection February 2004
probabilistic | probability |
---------------+------------------------+-------------------
non-uniform |e.g. packet position, | selection function,
probabilistic | packet content(parts) | probability calc.
---------------+------------------------+-------------------
non-uniform |e.g. flow state, | selection function,
flow-state | packet content(parts) | probability calc.
---------------+------------------------+-------------------
mask/match | packet content(parts) | filter function
---------------+------------------------+-------------------
hash-based | packet content(parts) | hash function
---------------+------------------------+-------------------
router state | router state | router state
| | discovery
---------------+------------------------+-------------------
5.1 Information Model for Sampling Techniques
In this section we define the information models for most common
sampling techniques. Here the selection function is pre-defined
and given by the selector ID.
Sampler Description:
SELECTOR_ID
SELECTOR_TYPE
SELECTOR_PARAMETERS
ASSOCIATIONS
Where:
SELECTOR_ID:
Unique ID for the packet sampler. The ID can be calculated under
consideration of the ASSOCIATIONS and a local ID.
SELECTOR_TYPE
For sampling processes the SELECTOR TYPE defines what sampling
algorithm is used.
Values: Systematic Count-based | Systematic Time-based | Random
n-out-of-N | Uniform Probabilistic | Non-uniform Probabilistic |
Non-uniform Flow-state
SELECTOR_PARAMETERS
For sampling processes the SELECTOR PARAMETERS define the input
parameters for the process. Interval length in systematic
sampling means, that all packets that arrive in this interval
are selected. The spacing parameter defines the spacing in time
Zseby, Molina, Raspall, Duffield Expires August 2004 [Page 25]
Internet Draft Techniques for IP Packet Selection February 2004
or number of packets between the end of one sampling interval
and the start of the next succeeding interval.
Case n out of N:
- Population size N, Sample size n
Case Systematic Time Based:
- Interval length (in usec), Spacing (in usec)
Case Systematic Count Based:
- Interval length(in packets), Spacing (in packets)
Case Uniform Probabilistic (with equal probability per packet):
- Sampling probability p
Case Non-uniform Probabilistic:
- Calculation function for sampling probability p (see also
section 3.1.2.2.4)
Case flow state:
- Information reported for flow state can be found in
[MolFl03](see also section 3.1.2.2.4)
ASSOCIATIONS
The ASSOCIATIONS field describes the observation point and
(possibly) the IPFIX processes to which the packet selector is
associated. The STREAM ID denotes the origin of the data stream
that is input to the selection function. It can be the
observation point directly or the ID of another selector. With
this it is possible to define combined schemes. If the STREAM ID
contains IDs from other selectors, one can derive the original
observation point from the selector definitions of these
specified selectors.
Values: <STREAM ID, IPFIX Metering process ID, IPFIX Exporting
process ID, IDs of other associated processes>
With STREAM ID: Observation point ID AND List of SELECTOR_IDs
5.2 Information Model for Filtering Techniques
In this section we define the information models for most common
filtering techniques. The information model structure closely
parallels the one presented for the sampling techniques.
Filter Description:
SELECTOR_ID
SELECTOR_TYPE
SELECTOR_PARAMETERS
Zseby, Molina, Raspall, Duffield Expires August 2004 [Page 26]
Internet Draft Techniques for IP Packet Selection February 2004
ASSOCIATIONS
Where:
SELECTOR_ID:
Unique ID for the packet filter. The ID can be calculated under
consideration of the ASSOCIATIONS and a local ID.
SELECTOR_TYPE
For filtering processes the SELECTOR TYPE defines what filtering
type is used.
Values: Matching | Hashing | Router_state
SELECTOR_PARAMETERS
For filtering processes the SELECTOR PARAMETERS define formally
the common property of the packet being filtered. For the
filters of type Matching and Hashing the definitions have a lot
of points in common.
Values:
Case Matching
- <Header type = ip v4 >
- <bit specification, header part>
- <Selection interval specification, header part>
- <Header type = ipv6>
- <bit specification, header part>
- <Selection interval specification, header part>
- <payload byte number N>
- <bit specification, payload part>
- <Selection interval specification, payload part>
Notes to Case Matching:
- The filter can be defined for the header part only, for the
payload part only or for both. In the latter case the
matching must be an AND of the two.
- The bit specification, for the header part, can be
specified for ipv4 or ipv6 only, or both
- In case of ipv4, the bit specification is a sequence of 20
Hexadecimal numbers [00,FF] specifying a 20 bytes bitmask
to be applied to the header
- In case of ipv6, it is a sequence of 40 Hexadecimal numbers
[00,FF] specifying a 40 bytes bitmask to be applied to the
header
- The bit specification, for the payload part, is a sequence
of Hexadecimal numbers [00,FF] specifying the bitmask to be
applied to the first N bytes of the payload, as specified
by the previous field. In case the Hexadecimal number
sequence is longer then N, only the first N numbers are
considered.
Zseby, Molina, Raspall, Duffield Expires August 2004 [Page 27]
Internet Draft Techniques for IP Packet Selection February 2004
- In case the payload is shorter than N, the packet will not
match the filter Other options, like padding with zeros,
may be considered in the future.
- The selection interval specification is a list of non
overlapping intervals [intv_begin, intv_end] where
intv_begin, intv_end are bit strings of length 20*8 (ipv4
case), 40*8 (ipv6 case), N*8 (payload case).
- A filter cannot be defined on the options field of the ipv4
header, neither on stacked headers of ipv6.
- This specification doesnÆt preclude the future definition
of a high level syntax for defining in a concise way bit
selection and matching rules in a more human readable form
(e.g. ôTCP port in [2000,3000]ö). The requirement is that
such a syntax can be univoquely compiled into the one
defined above
Case Hashing:
- <Header type = ipv4>
- <Input bit specification, header part>
- <Header type = ipv6>
- <Input bit specification, header part>
- <payload byte number N>
- <Input bit specification, payload part>
- <additional hash input bits (seed)>
- Hashing function specification
- Hash function name
- Length of input key (eliminate 0x bytes)
- Output value (length M and bitmask)
- Selection interval specification, as a list of non
overlapping intervals [start value, end value] where
value is in [0,2^M-1]
- Additional parameters dependent on specific hash
function
Notes to Case Hashing:
- On Input bit specifications fields, the same notes on bit
specifications of the Matching case reported above apply
- Some hash functions, their detailed functioning and their
specific parameter list are described in [NiMD03].
Case Router State:
- Ingress interface at which the packet arrives equals a
specified value
- Egress interface to which the packet is routed equals a
specified value
- Packet violated Access Control List (ACL) on the router
- Reverse Path Forwarding (RPF) failed for the packet
Zseby, Molina, Raspall, Duffield Expires August 2004 [Page 28]
Internet Draft Techniques for IP Packet Selection February 2004
- Resource Reservation is insufficient for the packet
- No route found for the packet
- Origin AS equals a specified value or lies within a given
range
- Destination AS equals a specified value or lies within a
given range
Note to Case Router State:
- All Router state entries can be linked by AND, OR, NOT
operators
ASSOCIATIONS
The ASSOCIATIONS field describes the observation point and
(possibly) the IPFIX processes to which the packet selector is
associated. The STREAM ID denotes the origin of the data stream
that is input to the selection function. It can be the
observation point directly or the ID of another selector. With
this it is possible to define combined schemes. If the STREAM ID
contains IDs from other selectors, one can derive the original
observation point from the selector definitions of these
specified selectors.
Values: <STREAM ID, IPFIX Metering process ID, IPFIX Exporting
process ID, IDs of other associated processes>
With STREAM ID: Observation point ID AND List of SELECTOR_IDs
6. Composite Techniques
Composite schemes are realized by using the STREAM ID in the
information models. The STREAM ID denotes from which selectors
the input stream originates. If multiple stream IDs are given,
this means that the selector operates on the packet stream
simply resulting from the time superposition of the output of
all the listed filters and samplers. Some examples of composite
schemes are reported below.
6.1 Cascaded filtering->sampling or sampling->filtering
If a filter precedes a sampling process the role of filtering is
to create a set of ôparent populationsö from a single stream
that can then be fed independently to different sampling
functions, with different parameters tuned for the population
itself (e.g. if streams of different intensity result from
filtering, it may be good to have different sampling rates). If
filtering follows a sampling process, the same sampling rate and
type is applied to the whole stream, independently of the
relative size of the streams resulting from the filtering
function. Moreover, also packets not destined to be selected in
Zseby, Molina, Raspall, Duffield Expires August 2004 [Page 29]
Internet Draft Techniques for IP Packet Selection February 2004
the filtering operation will ôloadö the sampling function. So,
in principle, filtering before sampling allows a more accurate
tuning of the sampling procedure, but if filters are too complex
to work at full line rate (e.g. because they have to access
router state information), sampling before filtering may be a
need.
6.2 Stratified Sampling
Stratified sampling is one example for using a composite
technique. The basic idea behind stratified sampling is to
increase the estimation accuracy by using a-priori information
about correlations of the investigated characteristic with some
other characteristic that is easier to obtain. The a-priori
information is used to perform an intelligent grouping of the
elements of the parent population. With this a higher estimation
accuracy can be achieved with the same sample size or the sample
size can be reduced without reducing the estimation accuracy.
Stratified sampling divides the sampling process into multiple
steps. First, the elements of the parent population are grouped
into subsets in accordance to a given characteristic. This
grouping can be done in multiple steps. Then samples are taken
from each subset.
The stronger the correlation between the characteristic used to
divide the parent population (stratification variable) and the
characteristic of interest (for which an estimate is sought
after), the easier is the consecutive sampling process and the
higher is the stratification gain. For instance if the dividing
characteristic were equal to the investigated characteristic,
each element of the sub-group would be a perfect representative
of that characteristic. In this case it would be sufficient to
take one arbitrary element out of each subgroup to get the
actual distribution of the characteristic in the parent
population. Therefore stratified sampling can reduce the costs
for the sampling process (i.e. the number of samples needed to
achieve a given level of confidence).
For stratified sampling one has to specify classification rules
for grouping the elements into subgroups and the sampling scheme
that is used within the subgroups. The classification rules can
be expressed by multiple filters. For the sampling scheme within
the subgroups the parameters have to be specified as described
above. The use of stratified sampling methods for measurement
purposes is described for instance in [ClPB93] and [Zseb03].
7. Security Considerations
Zseby, Molina, Raspall, Duffield Expires August 2004 [Page 30]
Internet Draft Techniques for IP Packet Selection February 2004
Malicious users or attackers may be interested to hide packets
from service providers or network operators. For instance if
packet selectors are used for accounting or intrusion detection
applications, users may want to prevent that packets are
selected. If a deterministic sampling scheme is used or a
selection scheme that takes packet content into account, the
user can shape or send packets in a way that they are less
likely to be selected (see also section 4.2.2). Even if the
selection function is unknown to the user, some insight into the
function can be obtained by performing experiments with
different packet sequences. This has to be taken into account
when choosing an appropriate packet selection technique.
Further security threats can occur if the configuration of
sampling parameters or the communication of sampling parameters
to the application is corrupted. This document only describes
sampling schemes that can be used for packet selection. It
neither describes a mechanism how those parameters are
configured nor how these parameters are communicated to the
application. Therefore the security threats that originate from
this kind of communication cannot be assessed with the
information given in this document.
8. References
[AmCa89] Paul D. Amer, Lillian N. Cassel: Management of
Sampled Real-Time Network Measurements, 14th
Conference on Local Computer Networks, October
1989, Minneapolis, pages 62-68, IEEE, 1989
[ClPB93] K.C. Claffy, George C Polyzos, Hans-Werner Braun:
Application of Sampling Methodologies to Network
Traffic Characterization, Proceedings of ACM
SIGCOMM'93, San Francisco, CA, USA, September 13 -
17, 1993
[CoGi98] I. Cozzani, S. Giordano: Traffic Sampling Methods
for end-to-end QoS Evaluation in Large
Heterogeneous Networks. Computer Networks and ISDN
Systems, 30 (16-18), September 1998.
[Du04] N.G. Duffield (Ed.), A Framework for Packet
Selection and Reporting, Internet Draft draft-ietf-
psamp-framework-05, work in progress, January 2004
[DuGr00] N.G. Duffield, M. Grossglauser: Trajectory Sampling
for Direct Traffic Observation, Proceedings of ACM
Zseby, Molina, Raspall, Duffield Expires August 2004 [Page 31]
Internet Draft Techniques for IP Packet Selection February 2004
SIGCOMM 2000, Stockholm, Sweden, August 28 -
September 1, 2000.
[DuGeGr02] N.G. Duffield, A. Gerber, M. Grossglauser,
Trajectory Engine: A Backend for Trajectory
Sampling, IEEE Network Operations and Management
Symposium 2002, Florence, Italy, April 15-19, 2002.
[DuLT01] Nick Duffield, Carsten Lund, and Mikkel Thorup:
Charging from Sampled Network Usage, ACM Internet
Measurement Workshop IMW 2001, San Francisco, USA,
November 1-2, 2001
[EsVa01] C. Estan and G. Varghese, ôNew Directions in
Traffic Measurement and Accountingö, ACM SIGCOMM
Internet Measurement Workshop 2001, San Francisco
(CA) Nov. 2001
[HT52] D.G. Horvitz and D.J. Thompson, A Generalization of
Sampling without replacement from a Finite
Universe. J. Amer. Statist. Assoc. Vol. 47, pp.
663-685, 1952.
[ISO3309] International Organization for Standardization,
"ISO Information Processing Systems - Data
Communication High-Level Data Link Control
Procedure - Frame Structure", IS 3309, October
rd
1984, 3 Edition.
[Jenk97] B. Jenkins: Algorithm Alley, Dr. Dobb's Journal,
September 1997.
http://burtleburtle.net/bob/hash/doobs.html
[JePP92] Jonathan Jedwab, Peter Phaal, Bob Pinna: Traffic
Estimation for the Largest Sources on a Network,
Using Packet Sampling with Limited Storage, HP
technical report, Managemenr, Mathematics and
Security Department, HP Laboratories, Bristol,
March 1992,
http://www.hpl.hp.com/techreports/92/HPL-92-35.html
[Knuth98] Donald E. Knuth: The Art of Computer Programming,
Volume 3: Searching and Sorting, Addison Wesley,
1998
Zseby, Molina, Raspall, Duffield Expires August 2004 [Page 32]
Internet Draft Techniques for IP Packet Selection February 2004
[MolFl03] M.Molina: Flow selection support in IPFIX, Internet
Draft <draft-molina-flow-selection-00.txt>, work in
progress, October 2003.
[Moli03] M.Molina: A scalable and efficient methodology for
flow monitoring in the internet, International
Teletraffic Congress (ITC-18), Berlin, Sep. 2003
[NiMD03] S. Niccolini, M.Molina, N.Duffield: Hash functions
description for packet selection, Internet Draft
<draft-niccolini-hash-descr-00.txt>, work in
progress, October 2003.
[QuZC03] J. Quittek, T. Zseby, B. Claise, S. Zander:
Requirements for IP Flow Information Export,
Internet Draft <draft-ietf-ipfix-reqs-13.txt>, work
in progress, December 2003
[Zseb03] T. Zseby: Stratification Strategies for Sampling-
based Non-intrusive Measurement of One-way Delay.
Passive and Active Measurement Workshop
Proceedings, La Jolla, CA, USA, pp. 171-179, Apr.
2003
9. Author's Addresses
Tanja Zseby
Fraunhofer Institute for Open Communication Systems
Kaiserin-Augusta-Allee 31
10589 Berlin
Germany
Phone: +49-30-34 63 7153
Fax: +49-30-34 53 8153
Email: zseby@fokus.fhg.de
Maurizio Molina
NEC Europe Ltd., Network Laboratories
Adenauerplatz 6
69115 Heidelberg
Germany
Phone: +49 6221 90511-18
Email: molina@ccrle.nec.de
Fredric Raspall
NEC Europe Ltd., Network Laboratories
Adenauerplatz 6
69115 Heidelberg
Germany
Zseby, Molina, Raspall, Duffield Expires August 2004 [Page 33]
Internet Draft Techniques for IP Packet Selection February 2004
Phone: +49 6221 90511-31
EMail: raspall@ccrle.nec.de
Nick Duffield
AT&T Labs - Research
Room B-139
180 Park Ave
Florham Park NJ 07932, USA
Phone: +1 973-360-8726
Email: duffield@research.att.com
10.
Intellectual Property Statement
AT&T Corporation may own intellectual property applicable to
this
contribution. The IETF has been notified of AT&T's licensing
intent for the specification contained in this document. See
http://www.ietf.org/ietf/IPR/ATT-GENERAL.txt for AT&T's IPR
statement.
11.
Full Copyright Statement
Copyright (C) The Internet Society (2002). All Rights Reserved.
This document and translations of it may be copied and furnished
to others, and derivative works that comment on or otherwise
explain it or assist in its implementation may be prepared,
copied, published and distributed, in whole or in part, without
restriction of any kind, provided that the above copyright
notice and this paragraph are included on all such copies and
derivative works. However, this document itself may not be
modified in any way, such as by removing the copyright notice or
references to the Internet Society or other Internet
organizations, except as needed for the purpose of developing
Internet standards in which case the procedures for copyrights
defined in the Internet Standards process must be followed, or
as required to translate it into languages other than
English.
The limited permissions granted above are perpetual and will not
be revoked by the Internet Society or its successors or assigns.
This document and the information contained herein is provided
on an
"AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET
ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE
OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY
IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A
PARTICULAR PURPOSE.
Zseby, Molina, Raspall, Duffield Expires August 2004 [Page 34]
Internet Draft Techniques for IP Packet Selection February 2004
Zseby, Molina, Raspall, Duffield Expires August 2004 [Page 35]