INTERNET-DRAFT                                     N. Ballou (Microsoft)
Expires: December 1, 1997               B. Hernacki & B. Polk (Netscape)
<draft-ballou-nntpsrch-03.txt>                               May 1, 1997



                   NNTP Full-text Search Extension



1.  Status of this Memo

This  document is an Internet-Draft.   Internet-Drafts are working docu-
ments of the Internet Engineering Task Force (IETF),  its areas, and its
working groups.  Note that   other groups  may also  distribute  working
documents as Internet-Drafts.

Internet-Drafts are draft documents valid   for a maximum of six  months
and may be updated,  replaced, or obsoleted   by other documents  at any
time.  It is inappropriate to use Internet- Drafts as reference material
or to cite them other than as ``work in progress.''

To  learn the current status   of any  Internet-Draft, please check  the
``1id-abstracts.txt''  listing  contained in  the Internet-Drafts Shadow
Directories on ds.internic.net  (US East Coast), nic.nordu.net (Europe),
ftp.isi.edu (US West Coast), or munnari.oz.au (Pacific Rim).

2.  Abstract

This  document describes  a   set of enhancements  to the   Network News
Transport  Protocol [NNTP-977] that  allows  full-text searching of news
articles in multiple newsgroups.   The proposed SEARCH command  supports
functionality similar to the [IMAP4] SEARCH command, minus user specific
search keys (i.e., ANSWERED,  DRAFT, FLAGGED, KEYWORD, NEW, OLD, RECENT,
SEEN) and minus search keys based  on headers that  do not exist in news
(i.e., CC, BCC, TO).

The availability of the extensions described  here will be advertised by
the  server using  the extension negotiation-mechanism  described in the
new NNTP protocol specification currently being developed [NNTP-NEW].















Ballou                                                          [Page 1]


INTERNET-DRAFT                                               May 1, 1997


3.  Introduction

The NNTP SEARCH command is sent from the client to the server to specify
and initiate a full-text  search on articles  in one or more newsgroups.
The NNTP SEARCH command is a subset of the  [IMAP4] SEARCH command, with
user property  and mail-specific header search  keys not present in NNTP
SEARCH.   The results of  an NNTP  Search  is OVER data  as specified in
[NNTP-NEW] for each article that satisfies the search criteria.

In addition, the XPAT command is extended so that  it  can  be  used  to
full-text  search  articles within a single newsgroup.  Both the headers
and the body of the articles are searched.

3.1.  New and Enhanced NNTP Commands

There are four new NNTP commands, three new options to the existing LIST
command, and enhancements to one existing command.

*    SEARCH

*    LIST SRCHFIELDS

*    LIST SEARCHABLE

*    XPAT

The SEARCH command runs a one-time search, returning overview-like data.

The LIST SRCHFIELDS command returns the fields that the server allows in
full-text searches.

The LIST SEARCHABLE command allows the client to determine  which  news-
groups are full-text searchable.

The XPAT command allows the pseudo-header  ":TEXT".   This  specifies  a
full-text  (headers  and  body) search of the articles in a single news-
group.

4.  Use of NNTP Extension Mechanism

The NNTP extension mechanism allows a server to describe  its  capabili-
ties.   The  following  extensions are used to describe the capabilities
described in this document.

4.1.  SEARCH Extension

The SEARCH extension means that the server supports the  following  com-
mands: SEARCH, LIST SEARCHABLE, LIST SRCHFIELDS.


Ballou                                                          [Page 2]


INTERNET-DRAFT                                               May 1, 1997


4.2.  XPATTEXT Extension

The XPATTEXT extension means that the server supports the  :TEXT  header
in the XPAT command, as described by this document.

5.  Command Descriptions

5.1. SEARCH Command

Arguments: optional character set specification
           optional newsgroup specification
           searching criteria (one or more)

Responses: 224 overview information follows
           412 no news group selected
           462 error performing search
           501 command syntax error
           502 no permission

The SEARCH  command searches the newsgroup for  articles that  match the
given searching criteria.   Searching  criteria consist of one   or more
search keys.  If there are articles that  match the search criteria, the
server responds with  code 224 and returns  OVER data for each  matching
article in a similar format as described  in [NNTP-NEW].  The one change
from  [NNTP-NEW]  OVER  format  is  to  change the article number  field
to a format that supports searches over multiple newsgroups. The article
ID  field  for  SEARCH  OVER  data  will use the format newsgroup:art-ID
rather than just an article number as defined in [NNTP-NEW].

A response of 421 indicates  that there are  no articles that match  the
search  criteria.  A  response  of 501 indicates a   syntax error in the
search  criteria.  A response  of 502 indicates   that the user does not
have permission to search one  or more of the  specified newsgroups.  If
the search criteria did not specify a newsgroup, and there is no current
newsgroup  (i.e.,  set using the NNTP   GROUP command), then  the server
returns  the error  code 412,   indicating that  no newsgroup has   been
specified.   A response of 462  indicates that the server encountered an
error when processing the search.

When multiple keys  are specified, the result  is the  intersection (AND
function) of all the  messages that match  those keys.  For example, the
criteria FROM "SMITH" SINCE 1-Feb-1994 refers to all articles from Smith
that were placed in the newsgroup since February 1,  1994.  A search key
may also be a parenthesized list of one  or more search  keys (e.g.  for
use with the OR and NOT keys).

Server  implementations  MAY exclude [MIME-1]  body  parts with terminal
content  types other than TEXT and  MESSAGE from consideration in SEARCH
matching.



Ballou                                                          [Page 3]


INTERNET-DRAFT                                               May 1, 1997


The optional character set  specification consists of the word "CHARSET"
followed by a registered MIME character set.  It indicates the character
set of the strings that appear in the search criteria.  [MIME-2] strings
that   appear in  RFC 822/MIME  message   headers, and [MIME-1]  content
transfer  encodings,  MUST be decoded     before matching.  Except   for
US-ASCII, it    is not required  that  any  particular character  set be
supported.  If the server does  not support the specified character set,
a 462 error code is returned.

The optional newsgroup specification consists of the word "IN"  followed
by  either  a  wildcard  character  "*"  -  indicating a search over all
newsgroups  - or a list  of  newsgroup  names  separated  by a comma.  A
newsgroup name can end with the wildcard string ".*" indicating a search
over  a  sub-hierarchy  of  the  newsgroup name  space.  If no newsgroup
specification  is  given,  the search is over the current newsgroup.  If
there is no current newsgroup, the server returns the 412 error code.

In all search  keys that use strings,  a message matches  the key if the
string is a substring of the field.  The matching is case-insensitive.

The ON, BEFORE, and SINCE search criteria use the same  date as  used in
the NNTP NEWNEWS command - the date the article arrived  on  the server.
A server indicates support for the ON, BEFORE, and SINCE search criteria
by listing :Date in the LIST SRCHFIELDS response.

The defined   search keys are as  follows.   Refer to the  Formal Syntax
section for the precise syntactic definitions of the arguments.

      <message range> Articles with article numbers corresponding to the
                      specified range.

      ALL             All Articles in the current newsgroup; the default
                      initial key for ANDing.

      BEFORE <date>   Articles whose server arrival date is earlier than
                      the specified date.

      BODY <string>   Articles that contain the specified string in the
                      body of the message.

      FROM <string>   Articles that contain the specified string in the
                      article structure's FROM field.

      HEADER <field-name> <string>
                      Articles that have a header with the specified
                      field-name (as defined in [RFC-822]) and that
                      contains the specified string in the [RFC-822]
                      field-body.




Ballou                                                          [Page 4]


INTERNET-DRAFT                                               May 1, 1997


      LARGER <n>     Articles with an size larger than the specified
                     number of octets.

      NOT <search-key>
                     Articles that do not match the specified search
                     key.



      ON <date>      Articles whose server arrival date is within the
                     specified date.

      OR <search-key1> <search-key2>
                     Articles that match either search key.

      SENTBEFORE <date>
                     Articles whose [RFC-822] Date: header is earlier
                     than the specified date.

      SENTON <date>  Articles whose [RFC-822] Date: header is within the
                     specified date.

      SENTSINCE <date>
                     Articles whose [RFC-822] Date: header is within or
                     later than the specified date.

      SINCE <date>   Articles whose server arrival date is within or
                     later than the specified date.

      SMALLER <n>    Articles with a size smaller than the specified
                     number of octets.

      SUBJECT <string>
                     Articles that contain the specified string in the
                     envelope structure's SUBJECT field.

      TEXT <string>  Articles that contain the specified string in the
                     header or body of the message.

   Example: C: SEARCH FROM "Smith" SINCE 1-Feb-1994
            S: 224 overview information follows
            S: comp.object:573 \t RE: object-oriented langs \t \
               "John Smith" <JSmith@xyz.com> \t Sun, 03 Nov 1996 \
               14:25:05 -0800 \t <01cbc9d5f3c70$eab9a2cd@xyz.com> \
               \t 4080 \t 33
            S: .

   Note: each field in OVER response is separated by a tab - shown as a
         \t in the example above.



Ballou                                                          [Page 5]


INTERNET-DRAFT                                               May 1, 1997




5.1.1.  Search Formal Syntax

The search query syntax is derived from the search  syntax  defined  for
the  IMAP4 protocol.  It is somewhat different because of the way inter-
national character sets need to be encoded.

The following syntax specification  uses the augmented Backus-Naur  Form
(BNF) notation  as   specified  in  [RFC-822]

Except as   noted otherwise,  all    alphabetic characters   are   case-
insensitive.  The use of upper or  lower case characters to define token
strings is  for editorial  clarity  only.  Implementations   MUST accept
these strings in a case-insensitive fashion.

   astring         ::= atom / string

   atom            ::= 1*ATOM_CHAR

   ATOM_CHAR       ::= <any CHAR except atom_specials>

   atom_specials   ::= "(" / ")" / SPACE / CTL / "*" / quoted_specials

   CHAR            ::= <any 7-bit US-ASCII character except NUL,
                        0x01 - 0x7f>

   CTL             ::= <any ASCII control character and DEL,
                        0x00 - 0x1f, 0x7f>

   date            ::= date_text / <"> date_text <">

   date_day        ::= 1*2digit
                       ;; Day of month

   date_month      ::= "Jan" / "Feb" / "Mar" / "Apr" / "May" / "Jun" /
                       "Jul" / "Aug" / "Sep" / "Oct" / "Nov" / "Dec"

   date_text       ::= date_day "-" date_month "-" date_year

   date_year       ::= 4digit

   digit           ::= "0" / digit_nz

   digit_nz        ::= "1" / "2" / "3" / "4" / "5" / "6" / "7" / "8" /
                       "9"

   header_fld_name ::= sstring




Ballou                                                          [Page 6]


INTERNET-DRAFT                                               May 1, 1997


   mstring         ::= A MIME-2 encoded string surrounded by double
                       quotes

   newsgroup       ::= atom [ ".*"]

   newsgroups      ::= "*" / newsgroup_list

   newsgroup_list  ::= newsgroup [ ","  newsgroup_list]

   number          ::= 1*digit
                       ;; Unsigned 32-bit integer
                       ;; (0 <= n < 4,294,967,296)

   nz_number       ::= digit_nz *digit
                       ;; Non-zero unsigned 32-bit integer
                       ;; (0 < n < 4,294,967,296)

   QUOTED_CHAR     ::= <any TEXT_CHAR except quoted_specials> /
                       "\" quoted_specials

   quoted_specials ::= <"> / "\"

   range           ::= nz_number / nz_number "-" [ nz_number ]
                       ;; Identifies a range of Articles.

   search          ::= "SEARCH" SPACE ["CHARSET" SPACE astring SPACE]
                       ["IN" SPACE newsgroups SPACE]
                       1#search_key
                       ;; [CHARSET] MUST be registered with IANA

   search_key      ::= "ALL" / "BODY" SPACE sstring /
                       "FROM" SPACE sstring / "ON" SPACE date /
                       "SINCE" SPACE date / "BEFORE" SPACE date /
                       "SUBJECT" SPACE sstring / "TEXT" SPACE sstring /
                       "HEADER" SPACE header_fld_name SPACE sstring /
                       "LARGER" SPACE number / "NOT" SPACE search_key /
                       "OR" SPACE search_key SPACE search_key /
                       "SENTBEFORE" SPACE date / "SENTON" SPACE date /
                       "SENTSINCE" SPACE date / "SMALLER" SPACE number /
                       range / "(" 1#search_key ")"

   SPACE           ::= 1*<ASCII SP, space, 0x20>

   sstring         ::= astring | mstring

   string          ::= <"> *QUOTED_CHAR <">

   TEXT_CHAR       ::= <any CHAR except CR and LF>




Ballou                                                          [Page 7]


INTERNET-DRAFT                                               May 1, 1997


5.2.  LIST SRCHFIELDS Command

Arguments: none

Responses: 224 data follws

The  LIST  SRCHFIELDS  command  Returns  a  list of which fields can  be
specified  in  full-text  search queries on the server.  The response is
a  list  of  searchable  fields,  one  per  line.  A "." on its own line
terminates  the  list.   The  fields  are  either  newsgroup headers, or
non-header fields supported by the query syntax.

The three currently defined non-header fields are ":Body", ":Text",  and
":Date".  ":Text"  means  all  the  searchable  text in the article, and
indicates  that  the  "text"  keyword  is  supported in the search query
language.  ":Body" means the body of the article, excluding the headers,
and  indicates  that the "body" keyword is supported in the search query
language.  ":Date"  means  the  date  at  which  an article arrived on a
server  -  similar  to  the  date used in the NNTP NEWNEWS command - and
indicates that the "ON", "SINCE", and "BEFORE" keywords are supported in
the search query language.

The "date", "text" and "body" search query fields are optional, but  the
server  must  indicate  whether  they  are  supported or not in the LIST
SRCHFIELDS response.

   Example: C: LIST SRCHFIELDS
            S: 224 Data follows.
            S: From
            S: Date
            S: Subject
            S: :Text
            S: .


5.3.  LIST SEARCHABLE Command

Arguments: none

Responses: 224 Data Follows

The LIST SEARECHABLE command returns a list of strings that define which
new groups are being indexed by  the  news server and are thus available
for  searching.  In  addition, the character sets allowed for each group
is returned.







Ballou                                                          [Page 8]


INTERNET-DRAFT                                               May 1, 1997


When there are newsgroups indexed it will return 224, followed  by  each
portion  of the tree that is indexed.  If all groups are indexed, a line
with "*" is returned.  If only some parts of the newsgroup hierarchy are
indexed, they are identified in the form <indexed-hierarchy>.*.  Clients
should not assume that these will always be top  level  hierarchies.   A
"." on its own line terminates the list.

The character sets allowed in full-text searches for each entry is  also
returned.   The  character sets are identified by the name as defined in
[MIME-1].

   Example: C: LIST SEARCHABLE
            S: 224 Data follows.
            S: alt.* US-ASCII
            S: comp.lang.* US-ASCII ISO-8859-1 ISO-8859-2
            S: mcom.* ISO-8859-1
            S: .

5.3 XPAT command enhancement

Arguments: header range|<message-id> pat [pat...]

Responses: <same as XPAT - see [NNTP-NEW]>

The XPAT command is enhanced in a simple way: The new value ":TEXT" will
be  supported  as  a header when invoking the command.  The :TEXT header
requests a full-text search the body and all headers  of  the  specified
articles.

When :TEXT is specified for the header, only a single "pat" is  allowed,
and  it  must  be  a  word or quoted string to search for, rather than a
wildmat pattern as allowed otherwise.

If :TEXT isn't specified as the header, the response is the same  as  it
always  has  been for XPAT, with each result line containing the article
number and the value of the header that matched the pattern.

If the :TEXT header is specified, the constant string "TEXT" is returned
in place of the value of the header that matched the pattern.

  Example: C: XPAT :TEXT 1000-2000 searchtext
           S: 221 Header follows
           S: 1021 TEXT
           S: 1024 TEXT
           S:.







Ballou                                                          [Page 9]


INTERNET-DRAFT                                               May 1, 1997

6.  Security Considerations

The search commands must be implemented in a way  that  does  not  allow
access  to  articles in newsgroups that a client is otherwise restricted
from reading due to access control rules.
















































Ballou                                                         [Page 10]


INTERNET-DRAFT                                               May 1, 1997

7.  Bibliography

[NNTP-977]
     Network News Transfer Protocol.  B. Kantor, Phil Lapsley, Request
     for Comment (RFC) 977, February 1986.

[NNTP-NEW]
     Network News Transfer Protocol.  S.  Barber INTERNET DRAFT, Sep-
     tember 1996.

[IMAP4]
     IMAP4 INTERNET MESSAGE ACCESS PROTOCOL - VERSION 4.  M Crispin,
     Request for Comment (RFC) 1730, December 1994


[MIME-1]
     Borenstein N., and N.  Freed, MIME (Multipurpose Internet Mail
     Extensions) Part One: Mechanisms for Specifying and Describing the
     Format of Internet Message Bodies, RFC 1521, Bellcore, Innosoft,
     September 1993.

[MIME-2]
     Moore, K., MIME (Multipurpose Internet Mail Extensions) Part Two:
     Message Header Extensions for Non-ASCII Text, RFC 1522, University
     of Tennessee, September 1993.


8.  Author's Address

   Nat Ballou
   Microsoft
   One Microsoft Way
   Redmond, WA 98052
   USA

   Phone: +1 206-703-0574
   Email: natba@microsoft.com


                  This Internet Draft expires April xx, 1997.













Ballou                                                         [Page 11]