Internet Draft
Document: <draft-ietf-psamp-sample-tech-03.txt> T. Zseby
Expires: April 2003 Fraunhofer FOKUS
M. Molina
NEC Europe Ltd.
F. Raspall
NEC Europe Ltd.
Nick Duffield
AT&T Labs - Research
October2003
Sampling and Filtering Techniques for IP Packet Selection
Status of this Memo
This document is an Internet-Draft and is in full conformance
with all provisions of Section 10 of RFC2026.
Internet-Drafts are working documents of the Internet
Engineering Task Force (IETF), its areas, and its working
groups. Note that other groups may also distribute working
documents as Internet-Drafts. Internet-Drafts are draft
documents valid for a maximum of six months and may be updated,
replaced, or obsoleted by other documents at any time. It is
inappropriate to use Internet-Drafts as reference material or to
cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html.
Abstract
This document describes sampling and filtering techniques for IP
packet selection. In two information models (one for sampling,
one for filtering) it defines what parameters are needed to
describe the most common selection schemes and shows how
techniques can be combined to build more elaborate packet
selectors. The information models are used for configuring the
selection technique in measurement processes and for reporting
the technique in use to a collector.
Zseby, Molina, Raspall, Duffield Expires April 2004 [Page 1]
Internet Draft Techniques for IP Packet Selection October 2003
Table of Contents
1. Introduction.................................................2
2. Terminology..................................................3
3. Scope and Deployment of Packet Selection Techniques..........8
3.1 Sampling.....................................................9
3.1.1 Systematic Sampling........................................10
3.1.2 Random Sampling............................................10
3.1.2.1 n-out-of-N sampling......................................11
3.1.2.2 Probabilistic sampling...................................11
3.1.2.2.1 Uniform Probabilistic Sampling.........................11
3.1.2.2.2 Non-Uniform Probabilistic Sampling.....................11
3.1.2.2.3 Non-Uniform flow State dependent sampling..............12
3.1.3 Overview of sampling schemes regarding content dependence..12
4. Filtering...................................................13
4.1 Mask/Match filtering........................................14
4.2 Hashing filtering...........................................14
4.2.1 Examples of Hashing Filtering applications.................15
4.2.1.1 Random sampling emulation................................15
4.2.1.2 Consistent packet selection and its applications.........15
4.2.2 Guarding Against Pitfalls and Vulnerabilities..............16
4.3 Router State filtering......................................17
5. Input Parameters and Information Models.....................17
5.1 Information Model for Sampling Techniques...................18
5.2 Information Model for Filtering Techniques..................20
6. Composite Techniques........................................23
6.1 Cascaded filtering->sampling or sampling->filtering.........23
6.2 Stratified Sampling.........................................23
7. Security Considerations.....................................24
8. References..................................................25
9. Author's Addresses..........................................26
10. Intellectual Property Statement.............................27
11. Full Copyright Statement....................................27
1. Introduction
Increasing data rates and growing measurement demands increase
the requirements for data collection resources. For measurement
scenarios in backbone networks it is often required to measure
whole traffic aggregates instead of single flows. Furthermore,
some measurement methods require the capturing of packet headers
or even parts of the payload. All this can lead to an
overwhelming amount of measurement data, resulting in high
demands regarding resources for metering, storage, transport and
post processing.
In some cases specialized hardware helps to fulfill these
demands but on the other hand increases the costs for providing
Zseby, Molina, Raspall, Duffield Expires April 2004 [Page 2]
Internet Draft Techniques for IP Packet Selection October 2003
the measurement. Since measurements are mainly a supporting
functionality for the service provisioning, measurement costs
usually should be limited to a small fraction of the costs of
the network service provisioning itself. A reduction of the data
that is considered and reported by a measurement process is
crucial to prevent the depletion of the available (i.e. the
affordable) resources. Such a reduction can be achieved by a
reasonable deployment of packet selection techniques, that
sample a subset of the packets while still allowing an
appropriate accuracy, or filter out all packets that are not of
interest for the measurement at all. Packet selection helps to
prevent an exhaustion of resources and to limit the measurement
costs. Examples for applications that benefit from packet
selection are given in [DuGG03].
2. Terminology
The terms used throughout this document are consistent with
those defined in [DuGG03].
2.1 Terminology for Observation and Selection
* Observation Point: a location in the network where a packet
stream is observed. Examples include:
- a line to which a probe is attached;
- a shared medium, such as an Ethernet-based LAN;
- a single port of a router, or set of interfaces (physical
or logical) of a router;
- an embedded measurement subsystem within an interface.
* Measurement Process: the combination of a selection process
followed by a reporting process.
* Observed Packet Stream: the set of all packets observed at the
observation point.
* Packet Stream: either the observed packet stream, or a subset
of it that forms the output of a selection process.
Note that packets selected from a stream, e.g. by sampling, do
not necessarily possess a property by which they can be
distinguished from packets that have not been selected. For
this reason the term stream is favored over flow, which is
defined as set of packets with common properties [QuZC02].
Zseby, Molina, Raspall, Duffield Expires April 2004 [Page 3]
Internet Draft Techniques for IP Packet Selection October 2003
* Selection Process: takes a packet stream as its input and
selects a subset of that stream as its output.
* Selector: defines the action of a selection process on a
single packet of its input. A selected packet becomes an
element of the output packet stream of the selection process.
The selector can make use of the following information in
determining whether a packet is selected:
- the packets content;
- information derived from the packet's treatment at the
observation point;
- any selection state that may be maintained by the
selection process.
* Packet Content: the union of the packet header (which includes
link layer, network layer and other encapsulation headers) and
the packet payload.
* Selection State: a selection process may maintain state
information for use by the selection process and/or the
reporting process. At a given time, the selection state may
depend on packets observed at and before that time, and other
variables. Examples include:
- sequence numbers of packets at the input of selectors;
- a timestamp of arrival of the packet at the observation
points;
- iterators for pseudorandom number generators;
- hash values calculated during selection;
- indicators of whether the packet was selected by a given
selector.
Selectors may change the selection state.
* Composite Selection Processes: selection processes may be
composed, in which case the output stream issuing from one of
the selection processes in the composition forms the input
stream for the succeeding selection process.
The composition of selection processes is also a selection
process; the elements of the composition can be viewed as
Zseby, Molina, Raspall, Duffield Expires April 2004 [Page 4]
Internet Draft Techniques for IP Packet Selection October 2003
selection subprocesses of the composite. A selection process
that is not a subprocess of another selection process takes
the observed packet stream as its input.
* Primitive Selection Process: a selection process that is not a
composition of other selection processes.
* Composite Selector: the selector of a composite selection
process.
* Primitive Selector: the selector of a primitive selection
process.
2.2 Terminology for Reporting, Export and Collection
* Reporting Process: creates a report stream on packets selected
by a selection process, in preparation for export. The input
to the reporting process comprises that information available
to the selection process per selected packet, specifically:
- the selected packets content;
- information derived from the selected packet's treatment
at the observation point;
- any selection state maintained by the inputting selection
process, reflecting any modifications to the selection
state made during selection of the packet.
* Report Stream: the output of a reporting process is a report
stream, comprising two distinguished types of information:
packet reports, and report interpretation.
* Packet Reports: a configurable subset of the per packet input
to the reporting process.
* Report Interpretation: subsidiary information relating to one
or more packets, that is used for interpretation of their
packet reports. Examples include configuration parameters of
the measurement process and of the export process.
* Export Process: sends the output of one or more reporting
processes to one or more collectors.
* Collector: a collector receives a report stream exported by
one or more export processes. In some cases, the host of the
measurement and/or export processes may also serve as the
collector.
Zseby, Molina, Raspall, Duffield Expires April 2004 [Page 5]
Internet Draft Techniques for IP Packet Selection October 2003
* Export packets: one or packet reports, and perhaps report
interpretation, are bundled by the export process into a
export packet for export to a collector.
Various possibilities for the high level architecture of these
elements is as follows.
MP = Measurement Process, EP = Export Process
+---------------------+ +------------------+
|Observation Point(s) | | Collector(1) |
|MP(s)--->EP----------+---------------->| |
|MP(s)--->EP----------+-------+-------->| |
+---------------------+ | +------------------+
|
+---------------------+ | +------------------+
|Observation Point(s) | +-------->| Collector(2) |
|MP(s)--->EP----------+---------------->| |
+---------------------+ +------------------+
+---------------------+
|Observation Point(s) |
|MP(s)--->EP---+ |
| | |
|Collector(3)<-+ |
+---------------------+
2.3 Packet Selection Terminology.
* Filtering: a filter is a selection operation that selects a
packet deterministically based on the packet content, its
treatment, and functions of these occurring in the selection
state. Examples include mask/match filtering, and hash-based
selection.
* Sampling: a selection operation that is not a filter is called
a sampling operation. This reflects the intuitive notion that
if the selection of a packet cannot be exactly determined from
its content alone, there must be some type of sampling taking
place.
* Content-independent Sampling: a sampling operation that does
not use packet content (or quantities derived from it) as the
basis for selection is called a content-independent sampling
operation. Examples include systematic sampling, and uniform
pseudorandom sampling driven by a pseudorandom number whose
generation is independent of packet content. Note that in
Zseby, Molina, Raspall, Duffield Expires April 2004 [Page 6]
Internet Draft Techniques for IP Packet Selection October 2003
independent sampling it is not necessary to access the packet
content in order to make the selection decision.
* Content-dependent Sampling: a sampling operation where
selection is dependent on packet content is called a content-
dependent sampling operation. Examples include pseudorandom
selection according to a probability that depends on the
contents of a packet field; note that this is not a filter.
* Approximate Selection: selection operations in any of the
above four categories may be approximated by operations in the
same or another category for the purposes of implementation.
For example, uniform pseudorandom sampling may be approximated
by hash-based selection, using a suitable hash function and
hash domain. In this case, the closeness of the approximation
depends on the choice of hash function and hash domain.
* Hash-based Selection: a filter specified by a hash domain, a
hash function, and hash range and a hash selection range.
* Hash Domain: a subset of the packet content and the packet
treatment, viewed as an N-bit string for some positive integer
N.
* Hash Range: a set of M-bit strings for some positive integer
M.
* Hash Function: a deterministic map from the hash domain into
the hash range.
* Hash Selection Range: a subset of the hash range. The packet
is selected if the action of the hash function on the hash
domain for the packet yields a result in the hash selection
range.
* Population size: the number of packets in a subset of a packet
stream.
* Sample size: the number of packets selected from a subset of a
packet stream by a selection operation.
* Attained Selection Frequency: the actual proportion of packets
selected by a selection operation from a given subset of a
packet stream. Calculated as the ratio of the sample size to
the population size for the subset. When the selection
operation is a sampling operation, or approximates a sampling
operation, we can speak of the Attained Sampling Frequency.
Zseby, Molina, Raspall, Duffield Expires April 2004 [Page 7]
Internet Draft Techniques for IP Packet Selection October 2003
* Target Sampling Frequency: a configurable sampling frequency
in a sampling operation. Due to the inherent statistical
variability of sampling, the target and attained sampling
rates will not in general be equal, although they may be close
in some circumstances, e.g., when the population size is
large.
3. Scope and Deployment of Packet Selection Techniques
Packet selection techniques generate a subset of packets from an
Observed Packet Stream at an observation point. We distinguish
between sampling and filtering.
Sampling is targeted at the selection of a representative subset
of packets. The subset is used to infer knowledge about the
whole set of observed packets without processing them all. . The
selection can depend on packet position, and/or on packet
content, and/or on (pseudo) random decisions.
Filtering selects a subset with common properties. This is used
if only a subset of packets is of interest. The properties can
be directly derived from the packet content, or depend on the
treatment given by the router to the packet. Filtering is a
deterministic operation. It depends on packet content or router
treatment. It never depends on packet position or on (pseudo)
random decisions.
Note that a common technique to select packets is to compute a
Hash Function on some bits of the packet header and/or content
and to select it if the Hash Value falls in the Hash Selection
Range. Since hashing is a deterministic operation on the packet
content, it is a filtering technique according to our
categorization.
Nevertheless, hash functions are sometimes used to emulate
random sampling. Depending on the chosen input bits, the Hash
Function and the Hash Selection Range, this technique can be
used to emulate the random selection of packets with a given
probability p.
It is also a powerful technique to consistently select the same
packet subset at multiple observation points [DuGr00]
The introduced classification is mainly useful for the
definition of an information model describing Primitive
Selectors . More complex selection techniques may then be
described through the composition of cascaded sampling and
filtering operations.
For example, a packet selection that weights the selection
probability on the basis of the packet length can be described
Zseby, Molina, Raspall, Duffield Expires April 2004 [Page 8]
Internet Draft Techniques for IP Packet Selection October 2003
as a set of filter/sampling cascades. However, this descriptive
approach is not intended to be rigid: if a common and
consolidated selection practice turns out to be too complex to
be described as a composition of the mentioned building blocks,
an ad hoc description can be specified instead.
Packet selectors can be part of an IPFIX metering process which
also can use the IPFIX exporting process. This is expressed as
association to one or more IPFIX processes.
3.1 Sampling
The deployment of sampling techniques aims at the provisioning
of information about a specific characteristic of the parent
population at a lower cost than a full census would demand. In
order to plan a suitable sampling strategy it is therefore
crucial to determine the needed type of information and the
desired degree of accuracy in advance.
First of all it is important to know the type of metric that
should be estimated. The metric of interest can range from
simple packet counts [JePP92] up to the estimation of whole
distributions of flow characteristics (e.g. packet
sizes)[ClPB93].
Secondly, the required accuracy of the information and with
this, the confidence that is aimed at, should be known in
advance. For instance for usage-based accounting the required
confidence for the estimation of packet counters can depend on
the monetary value that corresponds to the transfer of one
packet. That means that a higher confidence could be required
for expensive packet flows (e.g. premium IP service) than for
cheaper flows (e.g. best effort). The accuracy requirements for
validating a previously agreed quality can also vary extremely
with the customer demands. These requirements are usually
determined by the service level agreement (SLA).
Sampling is considered as part of the metering process. A
metering process consists of multiple functions (capturing, time
stamping, etc.). Sampling can be applied at different functions
of the metering process. In the following we consider a measured
IP packet with its observation point and timestamp as basis
elements of the parent population.
The sampling method and the parameters in use must be clearly
communicated to all applications that use the measurement data.
Only with this knowledge a correct interpretation of the
measurement results can be ensured.
Zseby, Molina, Raspall, Duffield Expires April 2004 [Page 9]
Internet Draft Techniques for IP Packet Selection October 2003
Sampling methods can be characterized by the sampling algorithm,
the trigger type used for starting a sampling interval and the
length of the sampling interval. These parameters are described
here in detail. The sampling algorithm describes the basic
process for selection of samples. In accordance to [AmCa89] and
[ClPB93] we define the following basic sampling processes:
3.1.1 Systematic Sampling
Systematic sampling describes the process of selecting the start
points and the duration of the selection intervals according to
a deterministic function. This can be for instance the periodic
selection of every k-th element of a trace but also the
selection of all packets that arrive at pre-defined points in
time. Even if the selection process does not follow a periodic
function (e.g. if the time between the sampling intervals varies
over time) we consider this as systematic sampling as long as
the selection is deterministic.
The use of systematic sampling always involves the risk of
biasing the results. If the systematics in the sampling process
resemble systematics in the observed stochastic process
(occurrence of the characteristic of interest in the network),
there is a high probability that the estimation will be biased.
Systematics in the observed process might not be known in
advance.
Here only equally spaced schemes are considered, where triggers
for sampling are periodic, either in time or in packet count.
All packets occurring in a selection interval (either in time or
packet count) beyond the trigger are selected.
Systematic count-based
In systematic count-based sampling the start and stop triggers
for the sampling interval are defined in accordance to the
special packet position (packet count).
Systematic time-based
In systematic count-based sampling the start and stop triggers
for the sampling interval are defined in accordance to the
temporal packet position (arrival time).
Both schemes are content-independent selection schemes.
3.1.2 Random Sampling
Random sampling selects the starting points of the sampling
intervals in accordance to a random process. The selection of
elements are independent experiments. With this, unbiased
estimations can be achieved. In contrast to systematic sampling,
Zseby, Molina, Raspall, Duffield Expires April 2004 [Page 10]
Internet Draft Techniques for IP Packet Selection October 2003
random sampling requires the generation of random numbers. One
can differentiate two methods of random sampling:
3.1.2.1 n-out-of-N sampling
In n-out-of-N sampling n elements are selected out of the parent
population that consists of N elements. One example would be to
generate n different random numbers in the range [1,N] and
select all packets which have a packet position equal to one of
the random numbers. For this kind of sampling the sample size n
is fixed.
3.1.2.2 Probabilistic sampling
In probabilistic sampling the decision whether an element is
selected or not is made in accordance to a pre-defined selection
probability. An example would be to flip a coin for each packet
and select all packets for which the coin showed the head. For
this kind of sampling the sample size can vary for different
trials. The selection probability does not necessarily has to be
the same for each packet. Therefore we distinguish between
uniform probabilistic sampling (with the same selection
probability for all packets) and non-uniform probabilistic
sampling (where the selection probability can vary for different
packets).
3.1.2.2.1 Uniform Probabilistic Sampling
For Uniform Probabilistic Sampling packets are selected
independently with some uniform probability p. This sampling can
be count-driven, and is sometimes referred to as geometric
random sampling, since the difference in count between
successive selected packets are independent random variables
with a geometric distribution of mean 1/p. A time-driven analog,
exponential random sampling, has the time between triggers
exponentially distributed.
Both geometric and exponential random sampling are examples of
what is known as additive random sampling, defined as sampling
where the intervals or counts between successive samples are
independent identically distributed random variable.
3.1.2.2.2 Non-Uniform Probabilistic Sampling
This is a variant of Uniform Probabilistic Sampling in which the
sampling probabilities can depend on the selection process
input. This can be used to weight sampling probabilities in
order e.g. to boost the chance of sampling packets that are rare
Zseby, Molina, Raspall, Duffield Expires April 2004 [Page 11]
Internet Draft Techniques for IP Packet Selection October 2003
but are deemed important. Unbiased estimators for quantitative
statistics are recovered by renormalization of sample values;
see [HT52].
3.1.2.2.3 Non-Uniform flow State dependent sampling
Another type of sampling that can be classified as probabilistic
Non-Uniform is closely related to the flow concept as defined
in [QuZC02], and it is only used jointly with a flow monitoring
function (IPFIX monitoring function). Packets are selected,
dependent on a selection state. The point, here, is that the
selection state is determined also by the state of the flow the
packet belongs to and/or by the state of the other flows
currently being monitored by the associated flow monitoring
function. An example for such an algorithm is the sample and
hold method described in [EsVa01]:
- if a packet accounts for a flow record that already exists in
the IPFIX flow recording process, select the packet (i.e.
simply update the flow record)
- if a packet doesn't account to any existing flow record,
select it with probability p. If selected, create a new flow
record for it.
A further algorithm that fits into the category of non-uniform
flow state dependent sampling is described in [Molina03].
This type of sampling is content dependent because the
identification of the flow the packet belongs to requires
analyzing part of the packet content. If the packet is selected,
then it is passed as an input to the IPFIX monitoring function
(this is called Local Export in [DuGG03]
Selecting the packet depending on the state of a flow cache is
useful when memory resources of the flow monitoring function are
scarce (i.e. theres no room to keep all the flows that have
been scheduled for monitoring). See [MolFl03] for a more
detailed description of the motivations for this type of
sampling and the impact on the IPFIX metering.
3.1.3 Overview of sampling schemes regarding content dependence
n-out-of-N sampling and uniform probabilistic sampling are
content-
-independent selection schemes. For non-uniform
probabilistic sampling and non-uniform flow state dependent
sampling the sampling probability can be based on packet
content. The following table shows what schemes are content-
independent and which can depend on the content.
Zseby, Molina, Raspall, Duffield Expires April 2004 [Page 12]
Internet Draft Techniques for IP Packet Selection October 2003
Sampling Scheme | content- | content-
| independent | dependent
------------------------------+---------------+---------------
systematic | X |
count-based | |
------------------------------+---------------+---------------
systematic | X |
time-based | |
------------------------------+---------------+---------------
random | X |
n-out-of-N | |
------------------------------+---------------+---------------
random | X |
uniform probabilistic | |
------------------------------+---------------+---------------
random | | X
non-uniform probabilistic | |
------------------------------+---------------+---------------
random | | X
non-uniform flow-state | |
------------------------------+---------------+---------------
4. Filtering
Filtering is the selection of packets based on the packet
content, the treatment of the packet at the observation point,
or deterministic functions of these occurring in the selection
state. The packet is selected if these quantities fall into a
specified range.
Packet filtering is done for a wide variety of purposes e.g. for
security, SLA enforcing, accounting. Depending on the type of
filtering, it can be applied in different parts of the metering
process. The role of filtering, as the word itself suggest, is
to separate all the packets having a certain property from those
not having it. A distinguishing characteristic from sampling is
that the property never depends on the packet position in time
or in the space, or on a random process.
We identify and describe in the following three filtering
techniques. The first two (Mask/Match filtering and Hashing
filtering) are stateless, and therefore can make their decision
based on the analysis of portion of the packet only. The other
(router state filtering) requires to access state information
after the analysis of part of the packet and is therefore more
Zseby, Molina, Raspall, Duffield Expires April 2004 [Page 13]
Internet Draft Techniques for IP Packet Selection October 2003
complex: its usage makes sense only in particular circumstances,
as described below.
4.1 Mask/Match filtering
This type of filtering selects a packet operating as follows:
first a combination of packets bit positions is selected taking
the logical AND of portion of the packets bits and a mask, then
its checked if the result, seen as an hexadecimal number, falls
within a selection range. In the simplest case the selection
range can be a single value (e.g. filter selects only the
packets with a certain specific header field, e.g. a specific
source IP address) but more in general the selection range can
be a sequence of non-overlapping intervals if high level
interfaces are used to specify mask and matches. The selected
bits of the packet arent necessarily only those of the IP
header, but in case both bits of the IP header and the payload
are selected, the masks and the selection intervals MUST be
specified separately for the header and the payload.
An implementation is not constrained to apply exactly all the
steps or in this sequence, provided that the result is
equivalent to a logical function doing it.
Examples of filters of this class are filters that select
packets based on the matching of some of the IP header fields
with a (possibly masked) specific value, filters that select
packets having some IP header fields values falling within a
range, filters that do the same as above on some of the
transport header fields (that are thus considered as part of the
payload), or a combination of all of the above mentioned
possibilities.
4.2 Hashing filtering
A Hash Function h maps the packet content c, or some portion of
it, onto a Hash Range R. The packet is selected if h(c) is an
element of S, which is a subset of R called the Hash Selection
Range. Thus hash-based sampling is indeed a particular case of
filtering: the object is selected if c is in inv(h(S)). But for
desirable Hash Functions the inverse image inv(h(S)) will be
extremely complex, and hence h would not be expressible as, say,
a match/mask filter or a simple combination of these.
Hash-based sampling has mainly two types of usage: it offers a
way to emulate random sampling by using packet content to
generate pseudorandom variates and a way to consistently select
subsets of packets that share a common property. In the
following subsections we give more details about them.
However, both usages require that the Hash Functions has two
statistical properties. First, the hash function h must have
Zseby, Molina, Raspall, Duffield Expires April 2004 [Page 14]
Internet Draft Techniques for IP Packet Selection October 2003
good mixing properties, in the sense that small changes in the
input (e.g. the flipping of a single bit) cause large changes in
the output (many bits change). Then any local clump of values of
c is spread widely over R by h, and so the distribution of h(c)
is fairly uniform even if the distribution of c is not. Then the
sampling rate is #S/#R, which can be tuned by choice of S. If S
and R are sets contiguous integers, h(c), suitably shifted and
normalized, can be interpreted as a pseudorandom variate. The
second desirable property depends more closely on the statistics
of the content c. In applications, the content c comprises a
number of distinct fields, c1 ... cm, e.g. source and
destination IP Address, IP identification, and TCP/UDP port
numbers (if present) for a packet. With a hash function
satisfying the first properties above, selection decisions will
appear uncorrelated with the contents of any individual field,
if the complementary fields are (i) sufficiently variable
themselves, and (ii) sufficiently uncorrelated with cj.
4.2.1 Examples of Hashing Filtering applications
4.2.1.1 Random sampling emulation
Although pseudorandom number generators with well understood
properties have been developed, they may not be the method of
choice in setting where computational resources are scarce. A
convenient alternative is to use hash functions of packet
content as a source of randomness. The hash (suitably
renormalized) is a pseudorandom variate in the interval [0,1].
Other schemes may use packet fields in iterators for
pseudorandom numbers.
The point here, is that the statistical properties of the
idealized packet selection law (such as independence of sampling
decisions for different packets, or independence on packet
content) may not be exactly shared by an implementation, but
only approximately so.
Although the selection decisions for non-uniform probabilistic
sampling (see Section 3.1.2.2.2 above) also depend on the packet
content, this form of sampling is distinguished from the use of
packet content to generate variates. In the former case, the
content only determines the selection probabilities: selection
could then proceed e.g by use of a variates obtained by an
independent pseudorandom number generator. In the latter case,
the content determines the pseudorandom variates rather than the
probabilities.
4.2.1.2 Consistent packet selection and its applications
Zseby, Molina, Raspall, Duffield Expires April 2004 [Page 15]
Internet Draft Techniques for IP Packet Selection October 2003
In Consistent packet selection, all routers in a network hash
parts of the packet content using identical hash function and
selection range. The domain of the hash is restricted to those
parts of the packet that are invariant from hop to hop. Fields
such as Time-to-Live, which is decremented per hop, and header
CRC, which is recalculated per hop, are thus excluded from the
hash domain. Thus, a given packet is selected at all either all
points on its path through the network, or at none. The domain
of the hash function needs to be wider than just a flow key, if
packets are to be selected quasi-randomly within flows (and e.g.
include portions of the payload), see [DuGr00]. If a report on
each selected packet is exported to a collector, the latter can
reconstruct trajectories of the selected packets, provided it
can match different reports on the same packet, and distinguish
these from reports on different packets. For this purpose,
reports may also contain a second distinct hash (identification
hash) of the selected packets.
Applications of consistent packet selection include (i)
estimation of the network path matrix, i.e., the traffic
intensities according to network path, broken down by flow; (ii)
detection of routing loops, as indicated by self-intersecting
trajectories; (iii) passive performance measurement: prematurely
terminating trajectories indicate packet loss, and packet One
Way Delay can be determined if reports include (synchronized)
timestamps; (iv) network attack tracing, of the actual paths
taken by attack packets with spoofed source addresses.
4.2.2 Guarding Against Pitfalls and Vulnerabilities
A concern is whether some large set of related packets could be
sampled at a rate that significantly differs from the expected
sampling rate, either (i) through unanticipated behavior in the
hash function, or (ii) because the packets had been deliberately
crafted to have this property.
The first point underlines the importance of using a hash
function with good mixing properties. Examples of such are CRC32
and hash functions based on modular arithmetic, see 6.4 in
[Knuth98]. The statistical properties of candidate hash
functions need to be evaluated, preferably on packet before
adoption for hash-based sampling
Hash sampling could be overloaded (or evaded) by an attacker if
the hash function and the selection rate are both known. A
service provider could build a first defense keeping the Hash
Selection Range S private. Then, an attacker could not determine
whether a crafted packet is selected, but he would still know
that a crafted a set of packets all with the same hash is either
all selected or all not selected. Moreover, when applications
Zseby, Molina, Raspall, Duffield Expires April 2004 [Page 16]
Internet Draft Techniques for IP Packet Selection October 2003
(like multi domain trajectory sampling, or One Way Delay
estimation across multiple domains) may require multiple
administrative entities to agree on a common hash function and
selection range, mutual trust between the entities cannot be
avoided and just keeping S secret may not be feasible. A
stronger defense is to employ a parametrizable hash function and
keep the parameter private: in this case, the set of hash values
of the packets could not be determined. Examples of parameters
are the initial vector in CRC32, and moduli in hashes based on
modular arithmetic.
4.3 Router State filtering
This class of filters select a packet on the basis of the
following conditions), possibly combined with the AND, OR or NOT
operators.
- Ingress interface at which packet arrives equals a
specified value
- Egress interface to which packet is routed to equals a
specified value
- Packet violated Access Control List (ACL) on the router
- Failed Reverse Path Forwarding (RPF)
- Failed Resource Reservation (RSVP)
- No route found for the packet
- Origin AS equals a specified value or lies within a given
range
- Destination AS equals a specified value or lies within a
given range
Router architectural considerations may preclude some
information concerning the packet treatment, e.g. routing state,
being available at line rate for selection of packets. However,
if selection not based on routing state has reduced down from
line rate, subselection based on routing state may be feasible.
5. Input Parameters and Information Models
This section gives an overview of different alternative
selection schemes and their required parameters. In order to be
compliant with PSAMP it is sufficient to implement one of the
proposed schemes.
The decision whether to select a packet or not is based on a
function which is performed when the packet arrives at the
selection process. Packet selection schemes differ in the input
parameters for the selection process and the functions they
Zseby, Molina, Raspall, Duffield Expires April 2004 [Page 17]
Internet Draft Techniques for IP Packet Selection October 2003
require to do the packet selection. The following table gives an
overview.
Scheme | input parameters | functions
---------------+------------------------+-------------------
systematic | packet position | packet counter
count-based | sampling pattern |
---------------+------------------------+-------------------
systematic | arrival time | clock or timer
time-based | sampling pattern |
---------------+------------------------+-------------------
random | packet position | packet counter,
n-out-of-N | sampling pattern | random numbers
| (random number list) |
---------------+------------------------+-------------------
uniform | sampling | random function
probabilistic | probability |
---------------+------------------------+-------------------
non-uniform |e.g. packet position | selection function,
probabilistic |or packet content(parts)| probability calc.
---------------+------------------------+-------------------
non-uniform |e.g. flow state | selection function,
flow-state |or packet content(parts)| probability calc.
---------------+------------------------+-------------------
mask/match | packet content(parts) | filter function
---------------+------------------------+-------------------
hash-based | packet content(parts) | hash function
---------------+------------------------+-------------------
router state | router state | router state
| | discovery
---------------+------------------------+-------------------
5.1 Information Model for Sampling Techniques
In this section we define the information models for most common
sampling techniques. Here the selection function is pre-defined
and given by the selector ID.
Sampler Description:
SELECTOR_ID
SELECTOR_TYPE
SELECTOR_PARAMETERS
OPERATING_TIME
ASSOCIATIONS
Where:
SELECTOR_ID:
Zseby, Molina, Raspall, Duffield Expires April 2004 [Page 18]
Internet Draft Techniques for IP Packet Selection October 2003
Unique ID for the packet sampler. The ID can be calculated under
consideration of the ASSOCIATIONS and a local ID.
SELECTOR_TYPE
Description: For sampling processes the SELECTOR TYPE defines
what sampling algorithm is used.
Values: Systematic count-based | Systematic time Based | Random
n-out-of-N | uniform probabilistic | non-uniform probabilistic |
non-uniform flow-state
SELECTOR_PARAMETERS
Description: For sampling processes the SELECTOR PARAMETERS
define the input parameters for the process. Interval length in
systematic sampling means, that all packets that arrive in this
interval are selected. The spacing parameter defines the spacing
in time or number of packets between the end of one sampling
interval and the start of the next succeeding interval.
Case n out of N:
- Population size N, Sample size n
Case Systematic Time Based:
- Interval length (in usec), Spacing (in usec)
Case Systematic Count Based:
- Interval length(in packets), Spacing (in packets)
Case uniform Probabilistic (with equal probability per packet):
- Sampling probability p
Case non-uniform probabilistic:
- Calculation function for sampling probability p
Case flow state:
- information reported for flow state can be found in
[MolFl03]
OPERATING_TIME
Description: The OPERATING_TIME parameter describes the
start/stop time of sampling process. List elements must not
overlap. The start time of the first element can be omitted,
meaning from now. The end time of the last element can be
omitted, meaning until sampler is removed.
Values: List of (Start time, End time)
ASSOCIATIONS
Zseby, Molina, Raspall, Duffield Expires April 2004 [Page 19]
Internet Draft Techniques for IP Packet Selection October 2003
Description: The ASSOCIATIONS field describes the observation
point and (possibly) the IPFIX processes to which the packet
selector can be associated.
The STREAM ID denotes the origin of the data stream that is
input to the selection function. It can be the observation point
directly or the ID of another selector. With this it is possible
to define combined schemes. If the STREAM ID contains IDs from
other selectors, one can derive the original observation point
from the selector definitions of these specified selectors.
Values: <STREAM ID, Metering process ID, Exporting process ID>
With STREAM ID: Observation point ID AND List of SELECTOR_IDs
5.2 Information Model for Filtering Techniques
In this section we define the information models for most common
filtering techniques. The information model structure closely
parallels the one presented for the sampling techniques.
Filter Description:
SELECTOR_ID
SELECTOR_TYPE
SELECTOR_PARAMETERS
OPERATING_TIME
ASSOCIATIONS
Where:
SELECTOR_ID:
Unique ID for the packet filter. The ID can be calculated under
consideration of the ASSOCIATIONS and a local ID.
SELECTOR_TYPE
Description: For filtering processes the SELECTOR TYPE defines
what filtering type is used.
Values: Matching | Hashing | Router_state
SELECTOR_PARAMETERS
Description: For filtering processes the SELECTOR PARAMETERS
define formally the common property of the packet being
filtered. For the filters of type Matching and Hashing the
definitions have a lot of points in common.
Values:
Case Matching
- <Header type = ip v4 >
- <bit specification, header part>
- <Selection interval specification, header part>
- <Header type = ipv6>
Zseby, Molina, Raspall, Duffield Expires April 2004 [Page 20]
Internet Draft Techniques for IP Packet Selection October 2003
- <bit specification, header part>
- <Selection interval specification, header part>
- <payload byte number N>
- <bit specification, payload part>
- <Selection interval specification, payload part>
Notes to Case Matching:
- The filter can be defined for the header part only, for the
payload part only or for both. In the latter case the
matching must be an AND of the two.
- The bit specification, for the header part, can be
specified for ipv4 or ipv6 only, or both
- In case of ipv4, the bit specification is a sequence of 20
Hexadecimal numbers [00,FF] specifying a 20 bytes bitmask
to be applied to the header
- In case of ipv6, it is a sequence of 40 Hexadecimal numbers
[00,FF] specifying a 40 bytes bitmask to be applied to the
header
- The bit specification, for the payload part, is a sequence
of Hexadecimal numbers [00,FF] specifying the bitmask to be
applied to the first N bytes of the payload, as specified
by the previous field. In case the Hexadecimal number
sequence is longer then N, only the first N numbers are
considered.
- In case the payload is shorter than N, the packet will not
match the filter Other options, like padding with zeros,
may be considered in the future.
- The selection interval specification is a list of non
overlapping intervals [intv_begin, intv_end] where
intv_begin, intv_end are bit strings of length 20*8 (ipv4
case), 40*8 (ipv6 case), N*8 (payload case).
- A filter cannot be defined on the options field of the ipv4
header, neither on stacked headers of ipv6.
- This specification doesnt preclude the future definition
of a high level syntax for defining in a concise way bit
selection and matching rules in a more human readable form
(e.g. TCP port in [2000,3000]). The requirement is that
such a syntax can be univoquely compiled into the one
defined above
Case Hashing:
- <Header type = ipv4>
- <Input bit specification, header part>
- <Header type = ipv6>
- <Input bit specification, header part>
- <payload byte number N>
- <Input bit specification, payload part>
- <additional hash input bits (seed)>
Zseby, Molina, Raspall, Duffield Expires April 2004 [Page 21]
Internet Draft Techniques for IP Packet Selection October 2003
- Hashing function specification
- Hash function name
- Length of input key (eliminate 0x bytes)
- Output value (length M and bitmask)
- Selection interval specification, as a list of non
overlapping intervals [start value, end value] where
value is in [0,2^M-1]
- Additional parameters dependent on specific hash
function
Notes to Case Hashing:
- On Input bit specifications fields, the same notes on bit
specifications of the Matching case reported above apply
- Some hash functions, their detailed functioning and their
specific parameter list are described in [NicMoDu].
Case Router State:
- Ingress interface at which packet arrives equals a
specified value
- Egress interface to which packet is routed to equals a
specified value
- Packet violated Access Control List (ACL) on the router
- Failed Reverse Path Forwarding (RPF)
- Failed Resource Reservation (RSVP)
- No route found for the packet
- Origin AS equals a specified value or lies within a given
range
- Destination AS equals a specified value or lies within a
given range
Note to Case Router State:
- All Router state entries can be linked by AND, OR, NOT
operators
OPERATING_TIME
Description: The OPERATING_TIME parameter describes the
start/stop time of filtering process. List elements must not
overlap. The start time of the first element can be omitted,
meaning from now. The end time of the last element can be
omitted, meaning until sampler is removed.
Values: List of (Start time, End time)
ASSOCIATIONS
Description: The ASSOCIATIONS field describes the observation
point and (possibly) the IPFIX processes to which the packet
selector can be associated. The STREAM ID denotes the origin of
the data stream that is input to the selection function. It can
Zseby, Molina, Raspall, Duffield Expires April 2004 [Page 22]
Internet Draft Techniques for IP Packet Selection October 2003
be the observation point directly or the ID of another selector.
With this it is possible to define combined schemes. If the
STREAM ID contains IDs from other selectors, one can derive the
original observation point from the selector definitions of
these specified selectors.
Values: STREAM ID, Metering process ID, Exporting process ID>
With STREAM ID: Observation point ID AND List of SELECTOR_IDs
6. Composite Techniques
Composite schemes are realized by using the STREAM ID in the
information models. The STREAM ID denotes from which selectors
the input stream originates. If multiple stream IDs are given,
this means that the selector operates on the packet stream
simply resulting from the time superposition of the output of
all the listed filters and samplers. Note that a sampler/filter
could be intermittently active, as defined in the OPERATING TIME
field.
Some examples of composite schemes are reported below.
6.1 Cascaded filtering->sampling or sampling->filtering
If a filter precedes a sampling process the role of filtering is
to create a set of parent populations from a single stream
that can then be fed independently to different sampling
functions, with different parameters tuned for the population
itself (e.g. if streams of different intensity result from
filtering, it may be good to have different sampling rates). If
filtering follows a sampling process, the same sampling rate and
type is applied to the whole stream, independently of the
relative size of the streams resulting from the filtering
function. Moreover, also packets not destined to be selected
will load the sampling function. So, in principle, filtering
before sampling allows a more accurate tuning of the sampling
procedure, but if filters are too complex to work at full line
rate (e.g. because they have to access router state
information), sampling before filtering may be a need.
6.2 Stratified Sampling
Stratified sampling is one example for using a composite
technique. The basic idea behind stratified sampling is to
increase the estimation accuracy by using a-priori information.
The a-priori information is used to perform an intelligent
grouping of the elements of the parent population. With this a
higher estimation accuracy can be achieved with the same sample
size.
Zseby, Molina, Raspall, Duffield Expires April 2004 [Page 23]
Internet Draft Techniques for IP Packet Selection October 2003
Stratified sampling divides the sampling process into multiple
steps. First, the elements of the parent population are grouped
into subsets in accordance to a given characteristic. This
grouping can be done in multiple steps. Then samples are taken
from each subset.
The stronger the correlation between the characteristic used to
divide the parent population and the characteristic of interest
(for which an estimate is sought after), the easier is the
consecutive sampling process and the higher is the
stratification gain. For instance if the dividing characteristic
were equal to the investigated characteristic, each element of
the sub-group would be a perfect representative of that
characteristic. In this case it would be sufficient to take one
arbitrary element out of each subgroup to get the actual
distribution of the characteristic in the parent population.
Therefore stratified sampling can reduce the costs for the
sampling process (i.e. the number of samples needed to achieve a
given level of confidence).
For stratified sampling one has to specify classification rules
for grouping the elements into subgroups and the sampling scheme
that is used within the subgroups. The classification rules can
be expressed by multiple filters. For the sampling scheme within
the subgroups the parameters have to be specified as described
above.
7. Security Considerations
Security threats can occur if the configuration of sampling
parameters or the communication of sampling parameters to the
application is corrupted. This document only describes sampling
schemes that can be used for packet selection. It neither
describes a mechanism how those parameters are configured nor
how these parameters are communicated to the application.
Therefore the security threats that originate from this kind of
communication cannot be assessed with the information given in
this document.
In some cases malicious users or attackers may be interested to
hide packets from the service provider. For instance if packet
selectors are used for accounting or intrusion detection
applications, users may want to prevent that packets are
selected. If a deterministic sampling scheme is used or a
selection scheme that takes packet content into account, the
user can shape or send packets in a way that they are less
likely to be selected. This has to be taken into account when
choosing an appropriate packet selection technique.
Zseby, Molina, Raspall, Duffield Expires April 2004 [Page 24]
Internet Draft Techniques for IP Packet Selection October 2003
8. References
[AmCa89] Paul D. Amer, Lillian N. Cassel: Management of
Sampled Real-Time Network Measurements, 14th
Conference on Local Computer Networks, October
1989, Minneapolis, pages 62-68, IEEE, 1989
[ClPB93] K.C. Claffy, George C Polyzos, Hans-Werner Braun:
Application of Sampling Methodologies to Network
Traffic Characterization, Proceedings of ACM
SIGCOMM'93, San Francisco, CA, USA, September 13 -
17, 1993
[CoGi98] I. Cozzani, S. Giordano: Traffic Sampling Methods
for end-to-end QoS Evaluation in Large
Heterogeneous Networks. Computer Networks and ISDN
Systems, 30 (16-18), September 1998.
[DuGG03] Nick Duffield, Albert Greenberg, Matthias
Grossglauser, Jennifer Rexford: A Framework for
Passive Packet Measurement, Internet Draft draft-
ietf-psamp-framework-01, work in progress, June
2003
[DuGr00] Nick Duffield, Matthias Grossglauser: Trajectory
Sampling for Direct Traffic Observation,
Proceedings of ACM SIGCOMM 2000, Stockholm, Sweden,
August 28 - September 1, 2000.
[EsVa01] C. Estan and G. Varghese, New Directions in
Traffic Measurement and Accounting, ACM SIGCOMM
Internet Measurement Workshop 2001, San Francisco
(CA) Nov. 2001
[HT52] D.G. Horvitz and D.J. Thompson, A Generalization of
Sampling without replacement from a Finite
Universe. J. Amer. Statist. Assoc. Vol. 47, pp.
663-685, 1952.
[JePP92] Jonathan Jedwab, Peter Phaal, Bob Pinna: Traffic
Estimation for the Largest Sources on a Network,
Using Packet Sampling with Limited Storage, HP
technical report, Managemenr, Mathematics and
Security Department, HP Laboratories, Bristol,
March 1992,
http://www.hpl.hp.com/techreports/92/HPL-92-35.html
Zseby, Molina, Raspall, Duffield Expires April 2004 [Page 25]
Internet Draft Techniques for IP Packet Selection October 2003
[Knuth98] Donald E. Knuth: The Art of Computer Programming,
Volume 3: Searching and Sorting, Addison Wesley,
1998
[MolFl03] M.Molina: Flow selection support in IPFIX, Internet
Draft <draft-molina-flow-selection-00.txt>, work in
progress, October 2003.
[Molina03] M.Molina: A scalable and efficient methodology for
flow monitoring in the internet, International
Teletraffic Congress (ITC-18), Berlin, Sep. 2003
[NicMoDu] S. Niccolini, M.Molina, N.Duffield: Hash functions
description for packet selection, Internet Draft
<draft-niccolini-hash-descr-00.txt>, work in
progress, October 2003.
[QuZC03] J. Quittek, T. Zseby, B. Claise, S. Zander:
Requirements for IP Flow Information Export,
Internet Draft <draft-ietf-ipfix-reqs-10.txt>, work
in progress, June 2002
[Zseb02] Tanja Zseby: Deployment of Sampling Methods for SLA
Validation with Non-Intrusive Measurements,
Proceedings of Passive and Active Measurement
Workshop (PAM 2002), Fort Collins, CO, USA, March
25-26, 2002
9. Author's Addresses
Tanja Zseby
Fraunhofer Institute for Open Communication Systems
Kaiserin-Augusta-Allee 31
10589 Berlin
Germany
Phone: +49-30-34 63 7153
Fax: +49-30-34 53 8153
Email: zseby@fokus.fhg.de
Maurizio Molina
NEC Europe Ltd., Network Laboratories
Adenauerplatz 6
69115 Heidelberg
Germany
Phone: +49 6221 90511-18
Email: molina@ccrle.nec.de
Fredric Raspall
NEC Europe Ltd., Network Laboratories
Zseby, Molina, Raspall, Duffield Expires April 2004 [Page 26]
Internet Draft Techniques for IP Packet Selection October 2003
Adenauerplatz 6
69115 Heidelberg
Germany
Phone: +49 6221 90511-31
EMail: raspall@ccrle.nec.de
Nick Duffield
AT&T Labs - Research
Room B-139
180 Park Ave
Florham Park NJ 07932, USA
Phone: +1 973-360-8726
Email: duffield@research.att.com
10. Intellectual Property Statement
AT&T Corporation may own intellectual property applicable to
this
contribution. The IETF has been notified of AT&T's licensing
intent for the specification contained in this document. See
http://www.ietf.org/ietf/IPR/ATT-GENERAL.txt for AT&T's IPR
statement.
11. Full Copyright Statement
Copyright (C) The Internet Society (2002). All Rights Reserved.
This document and translations of it may be copied and furnished
to others, and derivative works that comment on or otherwise
explain it or assist in its implementation may be prepared,
copied, published and distributed, in whole or in part, without
restriction of any kind, provided that the above copyright
notice and this paragraph are included on all such copies and
derivative works. However, this document itself may not be
modified in any way, such as by removing the copyright notice or
references to the Internet Society or other Internet
organizations, except as needed for the purpose of developing
Internet standards in which case the procedures for copyrights
defined in the Internet Standards process must be followed, or
as required to translate it into languages other than
English.
The limited permissions granted above are perpetual and will not
be revoked by the Internet Society or its successors or assigns.
This document and the information contained herein is provided
on an
"AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET
ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE
Zseby, Molina, Raspall, Duffield Expires April 2004 [Page 27]
Internet Draft Techniques for IP Packet Selection October 2003
OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY
IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A
PARTICULAR PURPOSE.
Zseby, Molina, Raspall, Duffield Expires April 2004 [Page 28]