INTERNET-DRAFT Carsten Bormann
Expires: September 1998 Universitaet Bremen TZI
March 1998
Network News Distribution Protocol: Architecture and Design Guidelines
draft-bormann-mnnp-nndp-00.txt
Status of this memo
This document is an Internet-Draft. Internet-Drafts are working
documents of the Internet Engineering Task Force (IETF), its areas,
and its working groups. Note that other groups may also distribute
working documents as Internet-Drafts.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as ``work in progress.''
To learn the current status of any Internet-Draft, please check the
``1id-abstracts.txt'' listing contained in the Internet-Drafts Shadow
Directories on ftp.is.co.za (Africa), nic.nordu.net (Europe),
munnari.oz.au (Pacific Rim), ds.internic.net (US East Coast), or
ftp.isi.edu (US West Coast).
Distribution of this document is unlimited.
Abstract
This document describes an architecture and a set of protocols for
distributing Netnews [RFC0977, RFC1036] via IP multicast enabled
networks. The architecture is designed to be useful in the global
Internet. In particular, it allows multiple news servers to
cooperate on multicasting each new article only once. To facilitate
scalability to tens of thousands of news servers, it also provides
for receive-only multicast participants (that continue to send
articles via conventional NNTP).
This document is a submission to the IETF MNNP working group.
Comments are solicited and should be addressed to the working groups'
mailing list at ietf-mnnp@va.pubnix.com and/or the author.
1. Introduction
Netnews (or Usenet news) is one of the more important systems for
electronic communication that make up what is now loosely called
``the Internet'' in the media. Usenet operates by flood-distributing
messages called articles between participating systems, called news
servers. The Usenet is experiencing growth problems as with any
other element of the thriving Internet environment.
Bormann [Page 1]
INTERNET-DRAFT NNDP: Architecture and Design Guidelines March 1998
It is widely recognized that NNTP, the article distribution system in
use in the Usenet, is running into scaling problems. Some ISPs are
reporting numbers of between 7 and 12 % for the NNTP contribution to
their backbone traffic -- this for a data stream that is less than 64
kbit/s in total (see below).
As Usenet is fundamentally a multicasting system, an obvious approach
is to apply the emerging Internet network layer multicasting
technology to Usenet distribution. One experiment described in the
literature, MUSE [firehose paper], transmitted Usenet articles as UDP
multicast packets between participating sites. While this experiment
was moderately successful, it suffered from packet loss problems
(that increase exponentially with the number of fragments generated
from one article). Also, a scalable security architecture was not
defined for this experiment.
This document defines an architecture and sketches two protocols to
make network layer multicasting more useful for news distribution.
The architecture will, in reference to an earlier experiment
[newscaster] be called Newscaster-2 or simply Newscaster; the two
protocols will be called NNDP (Network News Distribution Protocol)
and NNDCP (Network News Distribution Coordination Protocol),
respectively.
1.1. Benefits of multicasting Netnews
Distributing Netnews via network layer multicast provides a number of
benefits. For ISPs, Newscaster can help to significantly reduce the
backbone NNTP load: Each article traverses each link (in the best
case) only once instead of traversing the backbone links multiple
times, once to each target news server.
One other benefit of Newscaster will be reduced article propagation
times -- while current NNTP servers can be very fast, Newscaster
replaces multiple unicast hops between news servers by a single
multicast hop. As propagation times currently measure on the order
of hours, a reduction to the order of minutes would be a nice
achievement; a reduction below that (to seconds) is, however, not
intended. (As a side benefit, Newscaster will reduce the link
bandwidth consumed by a leaf news receiver by using batching and
compression and by reducing the NNTP/TCP/IP overhead incurred per
article.)
1.2. Basic Assumptions
This document makes a number of assumptions about the basic technical
parameters of the Netnews system. We assume a total number of new
news articles to be distributed per day in the few hundred thousands,
i.e., one to a few articles per second. We also assume that the
total volume of those articles is on the order of hundreds of
megabytes per day, i.e., tens to a few hundreds of kbit/s.
Newscaster-2 is scalable beyond those numbers, but not infinitely so.
[In particular, ``similar'' problems with different technical
Bormann [Page 2]
INTERNET-DRAFT NNDP: Architecture and Design Guidelines March 1998
parameters (such as live stock price feeds) are not necessarily
supported as efficiently as the actual worldwide Netnews system;
solving such similar problems is explicitly a non-goal of the
architecture.]
In addition, we assume that the concept of News servers that receive
a full feed of news articles continues to be useful. On-demand
retrieval of news articles from neighboring servers is an interesting
concept but outside the scope. We believe that most News servers
will want to receive most of the articles in the Netnews system;
Newscaster does not support elaborate mechanisms to receive a
specific subset of articles that cover exactly the newsgroups that
are ``subscribed'' by a News server. (Newscaster does support
partitioning the global news-feed into a few general subsets, such as
alt.* and comp.*/sci.*.)
One very important point in the design of a multicast Netnews
distribution system is that, even if it takes off quickly, News
server administrators will not simply turn off their existing, well-
understood and robust system of NNTP feeds. To make a feature out of
what could be considered a bug, the Newscaster system is intended to
work with and be supplementary to the NNTP system. Newscaster-based
news servers continue to speak NNTP to neighboring systems, using
NNTP as a background scheme to fill in articles that it might have
missed in the multicast distribution. Therefore, Newscaster can be a
much more light-weight protocol as it needs not be 100 % reliable.
1.3. The multiple-entry problem
Given that Newscaster is not replacing, but supplementing NNTP, and
that the Newscaster system will for a long time be only a subset of
the global Netnews system, the two distribution mechanisms need to
cooperate. The most significant problem here is that a single news
article may be flood-distributed from its source via NNTP and reach
multiple Newscaster systems at about the same time (observations in
the live network show that this now often happens for multiple well-
connected news servers within a second). As, in a multicast
scenario, there is no way to ask all the receivers whether they
already have received an article, this, without further mechanisms,
would mean that Newscasters regularly send multiple redundant copies
of a single article.
This document proposes a coordination protocol between Newscaster
systems to decide which Newscaster system distributes a particular
article. The coordination protocol is separate from the distribution
protocol; receive-only sites need not be involved in the coordination
protocol. Note that correctness of the coordination protocol is not
a prerequisite to correctness of the overall system, only to its
efficiency, i.e., an occasional slip (multiple transmission of one
article) is tolerable.
Bormann [Page 3]
INTERNET-DRAFT NNDP: Architecture and Design Guidelines March 1998
2. The Newscaster Architecture
2.1. Protocols
Newscaster assumes an underlying IP multicast network such as the
experimental Mbone and/or the operational IP multicast networks being
deployed by many ISPs. The multicast network is assumed to be able
to sustain a rate-controlled low-bandwidth stream of packets for
extended periods; the only form of congestion control envisaged is
that receivers can drop out if they experience consistent congestion.
To achieve a degree of performance in the presence of losses in the
experimental Mbone, some form of error control is required. To
achieve good scalability without router support, the distribution
protocol only uses forward error correction; as news servers gain
multicast connectivity, they simply can start listening to the feed
without having to send any (unicast or multicast) data.
The coordination protocol does not need to be as scalable as the
distribution protocol: It will be hard to impossible to coordinate
between a few tens of thousand news servers, and various features of
the distribution protocol (batching, compression, digital signatures)
argue for limiting the number of active Newscaster servers. We
assume that new articles travel via NNTP to the nearest active
Newscaster system and are multicast from there to the rest of the
world.
Appendix A defines a preliminary coordination protocol based on a
multicast transport protocol called MTP-2. (This protocol is a
version of MTP (RFC1301) that was developed further to be more useful
in WANs. It allows multicasting a sequence of arbitrary size
messages, each of which can consist of one or more multicast packets.
The MTP-2 protocol provides a global sequencing of the messages, as
well as global rate control.)
Other coordination protocols may be defined. Passive, receive-only
Newscaster systems need not be aware of the coordination protocol
being used -- they only need to understand the distribution protocol.
In particular, the distribution protocol can be used from a single
source to a local (e.g., per-ISP) set of receivers; the coordination
protocol then becomes trivial.
2.2. Operation of active Newscasters
A news server actively participating in the Newscaster system is
simply called a Newscaster. The set of cooperating Newscasters is
called the Newscaster Web. The entire Web is a single news system
from the point of view of RFC1036 Path headers. For the global
Newscaster Web, the name of the news system as it occurs in the Path
header is "newscaster-2.mcast.net". Additional local Newscaster Webs
can be created, if needed, under different names.
Bormann [Page 4]
INTERNET-DRAFT NNDP: Architecture and Design Guidelines March 1998
Each Newscaster examines each article it receives via NNTP or other
means whether it already contains a Newscaster Path header entry and
immediately removes it from further consideration in the Newscaster
Web if this is the case (in the INN implementation of the Netnews
protocols, this is done automatically if the outgoing link is
identified by the Web name, e.g. "newscaster-2.mcast.net").
Those articles that do not contain a Newscaster Path header entry are
then prepared for being multicast into the Web. Several such
articles will generally be sent together as a batch. The
coordination protocol is used to decide, for each article, whether it
is actually this Newscaster which will distribute the article. At
the service interface, an implementation of a coordination protocol
receives a set of message-ids (a tentative batch) as input and
returns a (possibly empty) subset of the message-ids to be sent in an
actual batch. In general, each Newscaster should have only one set
of articles in progress with the coordination protocol at any point
in time. Further articles arriving during processing by the
coordination protocol should be collected for a future tentative
batch. Also, Newscasters should wait a few seconds for further
articles to arrive before submitting a new batch to the coordination
protocol.
Actual batches are then formed out of the articles selected according
to RFC 1036, section 4.3. They are then compressed using the gzip
format (RFC1952) and digitally signed (see below). Finally, they are
distributed using the distribution protocol.
2.3. Security
Any system that transports Netnews must provide some basic security
against spoofing attacks. Since the multicasting system itself
provides only very limited assurances that a source address is
correct, we resort to cryptographic measures.
Simple shared-secret authentication is not scalable -- in a
production version, thousands of News server administrators would
have to be in possession of the key. Instead, a public key system is
used, based on a web-of-trust security policy.
In the current NNTP system, each news server administrator trusts its
neighbor news server administrators to institute a good local usage
policy and to respond to incidents in a manner that helps to preserve
the integrity of the news system. The transitive closure of this web
of trust equals the actual connectivity of the news system. If a
news administrator misbehaves, he runs the risk of being
disconnected.
The Newscaster security policy attempts to mimic this existing policy
by cryptographic means. Instead of creating NNTP links to
``neighboring'' systems, a news administrator creates certificates
for all the Newscasters that she trusts. These certificates are
regularly distributed in a newsgroup that is reserved for this
Bormann [Page 5]
INTERNET-DRAFT NNDP: Architecture and Design Guidelines March 1998
purpose (such as, news.config.newscaster), ensuring they can be
received even by sites that are not yet in possession of all the
certificates. Every receive-only system has to trust one or more
sites (e.g., the Newscaster equivalent of a ``well-connected site'')
to root its certificate chain. If a receiver of a Newscaster batch
does not find a certificate chain that verifies the signature of the
batch, it discards the batch.
* Issue *: What type of key system and digital signature is used?
Newscaster should provide relatively fast signature checking with
modest, but (due to batching) not necessarily stellar signing
performance. The author would tend to use RFC1991 type (PGP)
formats, using RSA and MD5.
3. NNDP: The distribution protocol
The NNDP distribution protocol is used to distribute payloads to all
receivers. Payloads will generally be small to a few dozen
kilobytes, but may be much larger in case a large article needs to be
transferred. The job of the distribution protocol is to:
- partition the payload into packets that can be multicast without
being fragmented on the way. We assume an Internet-wide MTU of
1280 (based on the IPv6 MTU) and save 80 bytes for header
overhead (IP, UDP, other), leaving 1200 bytes for the
distribution protocol data.
- add forward error correction. We use Vandermonde matrices as
implemented by Luigi Rizzo
[http://www.iet.unipi.it/~luigi/vdm.tgz]. The amount of error
correction to be added is a system parameter: For small batches,
we always add at least one FEC packet. For larger batches, the
FEC overhead is defined by a constant expansion factor. (This
factor could be chosen to match the TCP equation at the rate
intended.) For very large batches, the batch is split into
units which are independently subjected to FEC (packets from all
units of a batch are interleaved to spread out the
transmission).
- multicast the data at a defined rate (leaky bucket model). It
is the job of the coordination protocol to assign a rate to each
batch to be sent. (The rate should be relatively low to space
out the packets, allowing FEC to work around burst losses.)
- enable reassembly/erasure processing at the receiver. The
batches are tagged by a unique, 80-bit global ID, which is
assigned by the coordination protocol (e.g., global source
ID/sequence number). (Note that reassembly errors are not
catastrophic, as an incorrectly reassembled batch will be
rejected at signature check.) Each packet carries a total batch
size, a unit number within the batch, a packet number within the
unit, and the number of packets to be sent per unit (N).
Bormann [Page 6]
INTERNET-DRAFT NNDP: Architecture and Design Guidelines March 1998
distribution protocol packet layout
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| global ID |
+ +
| |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| | N |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| pkt idx | unit idx |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| total batch size |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| rate |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| data |
| .... |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
(For a discussion of the rate parameter, see NNDCP below.)
* Issue *: What is a good unit size? E.g., 128 KB? Should we
actually use the TCP equivalence equation to compute an expansion
factor from the rate?
4. Acknowledgments
This document has been prompted by the discussions in the MNNP BOF at
the Washington IETF. In particular, the author would like to thank
Joe Malcolm for the thought-provoking discussions at this IETF.
5. References
TBD
6. Addresses
6.1. Working Group
[The MNNP working group is in creation.]
6.2. Author's address
Bormann [Page 7]
INTERNET-DRAFT NNDP: Architecture and Design Guidelines March 1998
Carsten Bormann
Universitaet Bremen FB3 TZI
Postfach 330440
D-28334 Bremen, GERMANY
cabo@tzi.org
phone +49.421.218-7024
fax +49.421.218-7000
7. Annex A: MTP-2 based coordination protocol
When a batch is being prepared, a short MTP-2 message (an
announcement) is sent that just contains the message IDs of the
articles in the batch. When this message has been transmitted in the
MTP-2 Web and all lower-numbered messages have arrived, the
Newscaster removes those articles from the batch that have been
announced in lower-numbered announcements. This, in the steady state
case, makes it unlikely that two Newscasters will be transmitting the
same article concurrently. However, Newscasters that return after a
multicast outage would start to transmit old articles (that they have
received via NNTP while other systems got them via Newscaster). To
minimize the impact of such late-comers on the Newscast efficiency,
Newscasters only newscast articles they have newly received while
being active in the Web (i.e., no spooling).
For IPv4, the global ID of a batch is composed of the concatenation
of the IP address of the MTP-2 master at the time of receiving the
announcement and the 24-bit MTP-2 sequence number, filled with zeroes
at the end.
Rate control is performed in the following way: Each Newscaster is
aware of the total system rate defined for the Web (e.g., 128
kbit/s). Newscasters that are transmitting batches share this
bandwidth by setting up short-term reservations. Each Newscaster
also maintains a running idea of all the reservations currently in
effect. Upon reception of an announcement, the receiving newscaster
considers half the unreserved system rate to be reserved for the
announcer. This reservation is corrected by the actual rate used by
the sender, once an NNDP packet is received for this batch (rate
field). The sender of a batch is allowed to use up to half of what
it considers to be the unreserved rate at the time it receives its
own announcement for this batch. Each Newscaster deletes a
reservation for a batch once the sender should have stopped sending
data, according to its actual chosen rate and the size of the batch
as indicated in the NNDP packets, or (if no NNDP packets were
received at all), after a timeout of T_SEND (T_SEND is initially set
to 15 seconds). Newscasters avoid using silly rates (i.e., less than
a very small fraction of the system rate for a large batch).
Bormann [Page 8]
INTERNET-DRAFT NNDP: Architecture and Design Guidelines March 1998
8. Annex B: Newscasters: Active vs. Passive
Given that there are tens of thousands of news servers in operation,
and that NNDCP is intended to work between maybe a thousand active
Newscasters, the question immediately comes to mind which news
servers should be active Newscasters and which should only listen to
the global Netnews distribution. In essence, this is of course a
judgment call, which may be guided by:
- Multicast connectivity. An active Newscaster obviously needs to
be able to source multicast traffic, not just receive it. Given
the current tendency of ISPs to charge extra for multicast
sourcing, many news servers may not want to become active
Newscasters.
- Path lengths. While the Newscaster architecture takes out many
hops from the Netnews distribution paths, an article needs to
traverse NNTP hops up to the first active Newscaster before it
can be efficiently multicast to the rest of the world. Often, a
(topological) region will want to maintain at least one active
Newscaster to minimize those path lengths.
- Maintaining the web of trust. Maintainers of active Newscasters
need to actively work on maintaining their position in the web
of trust that is used as the security foundation of Newscaster.
Bormann [Page 9]