Message Organization Working Group T. Sirainen
Internet-Draft November 16, 2010
Intended status: Standards Track
Expires: May 20, 2011
IMAP4 Extension for Fuzzy Search
draft-ietf-morg-fuzzy-search-03
Abstract
This document describes an IMAP protocol extension enabling server to
perform searches with inexact matching and assigning relevancy scores
for matched messages.
Note
A revised version of this draft document will be submitted to the RFC
editor as a Proposed Standard for the Internet Community. Discussion
and suggestions for improvement are requested, and should be sent to
morg@ietf.org.
Status of this Memo
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet-
Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
This Internet-Draft will expire on May 20, 2011.
Copyright Notice
Copyright (c) 2010 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents
Sirainen Expires May 20, 2011 [Page 1]
Internet-Draft IMAP4 FUZZY SEARCH November 2010
carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License.
1. Conventions used in this document
In examples, "C:" indicates lines sent by a client that is connected
to a server. "S:" indicates lines sent by the server to the client.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC 2119 [Kwds].
2. Introduction
When humans perform searches in IMAP clients, they typically want to
see the most relevant search results first. IMAP servers are able to
do this in the most efficient way when they're free to internally
decide how searches should match messages. This document describes a
new SEARCH=FUZZY extension that provides such functionality.
3. The FUZZY Search Key
FUZZY search key takes another search key as its argument. Server is
allowed to perform all matching in an implementation-defined manner
for this search key, including ignoring the active comparator as
defined by [RFC5255]. Typically this would be used to search for
strings, for example:
C: A1 SEARCH FUZZY (SUBJECT "IMAP break")
S: * SEARCH 1 5 10
S: A1 OK Search completed.
Besides matching messages with subject "IMAP break", the above search
may also match messages with subjects "broken IMAP", "IMAP is
broken", or anything else the server decides that might be a good
match.
This example does a fuzzy SUBJECT search, but a non-fuzzy FROM
search:
C: A2 SEARCH FUZZY SUBJECT work FROM user@example.com
S: * SEARCH 1 4
Sirainen Expires May 20, 2011 [Page 2]
Internet-Draft IMAP4 FUZZY SEARCH November 2010
S: A2 OK Search completed.
It is implementation-defined how server handles multiple separate
FUZZY search keys.
4. Relevancy Scores for Search Results
Servers SHOULD assign a search relevancy score for each matched
message when the FUZZY search key is given. Relevancy scores are
given in range 1-100, where 100 is the highest relevancy. The
relevancy scores SHOULD use the full 1-100 range, so that clients can
show them to users in a meaningful way, such as a percentage value.
As the name already tells, relevancy scores specify how relevant to
the search the matched message is. It's not necessarily the same as
how precisely the message matched. For example a message whose
subject matches fuzzily the search string might get a higher
relevancy score than a message whose body had the exact string in the
middle of a sentence. When multiple search keys are matched fuzzily,
it's server-dependent on how the relevancy score is calculated.
If server also advertises the ESEARCH capability as defined by
[ESEARCH], the relevancy scores can be retrieved using the new
RELEVANCY return option for SEARCH:
C: B1 SEARCH RETURN (RELEVANCY ALL) FUZZY TEXT "Helo"
S: * ESEARCH (TAG "B2") ALL 1,5,10 RELEVANCY (4 99 42)
S: B1 OK Search completed.
The RELEVANCY return option MUST NOT be used unless FUZZY search key
is also given. Note that SEARCH results aren't sorted by relevancy,
SORT is needed for that.
5. Fuzzy matching with non-string search keys
Fuzzy matching is not limited to just string matching. All search
keys SHOULD be matched fuzzily, although what exactly that means for
different search keys is left up to server implementations to decide
-- including deciding that fuzzy matching is meaningless for a
particular key, and falling back to exact matching. Some suggestions
are given below.
Dates: A typical example could be when a user wants to find a message
"from Dave about a week ago". A client could perform this search
using SEARCH FUZZY (FROM "Dave" SINCE 21-Jan-2009 BEFORE 24-Jan-
2009). Server could return messages outside the specified date
Sirainen Expires May 20, 2011 [Page 3]
Internet-Draft IMAP4 FUZZY SEARCH November 2010
range, but the further away the message is, the lower the relevancy
score.
Sizes: These should be handled similar to dates. If a user wants to
search for "about 1 MB attachments", the client could do this by
sending SEARCH FUZZY (LARGER 900000 SMALLER 1100000). Again the
further away the message size is from the specified range, the lower
the relevancy score.
Flags: Server could return messages that don't have the specified
flags, but with a lower relevancy score.
UIDs, sequences, modification sequences: These are examples of keys
for which exact matching is probably what makes sense.
Alternatively, a server might choose, for instance, to expand a UID
range by 5% on each side.
6. Extensions to SORT
If server also advertises the SORT capability as defined by [SORT],
the results can be sorted by the new RELEVANCY sort criteria:
C: C1 SORT (RELEVANCY) UTF-8 FUZZY SUBJECT "Helo"
S: * SORT 5 10 1
S: C1 OK Sort completed.
The message with the highest score is returned first. As with
RELEVANCY return option, RELEVANCY sort criteria MUST NOT be used
unless FUZZY search key is also given.
If server also advertises the ESORT capability as defined by
[CONTEXT], the relevancy scores can be retrieved using the new
RELEVANCY return option for SORT:
C: C2 SORT RETURN (RELEVANCY ALL) (RELEVANCY) UTF-8 FUZZY TEXT "Helo"
S: * ESEARCH (TAG "C2") ALL 5,10,1 RELEVANCY (99 42 4)
S: C2 OK Sort completed.
To limit the number of returned messages, use the PARTIAL return
option. For example this returns the 10 most relevant messages:
C: C3 SORT RETURN (PARTIAL 1:10) (RELEVANCY) UTF-8 FUZZY TEXT "World"
S: * ESEARCH (TAG "C3") PARTIAL (1:10 42,9,34,13,15,4,2,7,23,82)
S: C3 OK Sort completed.
Sirainen Expires May 20, 2011 [Page 4]
Internet-Draft IMAP4 FUZZY SEARCH November 2010
7. Formal Syntax
The following syntax specification uses the augmented Backus-Naur
Form (BNF) as described in [ABNF]. It includes definitions from
[RFC3501], [IMAP-ABNF] and [SORT].
capability =/ "SEARCH=FUZZY"
score = 1*3DIGIT
;; (1 <= n <= 100)
score-list = "(" [score *(SP score)] ")"
search-key =/ "FUZZY" SP search-key
search-return-data =/ "RELEVANCY" SP score-list
;; Conforms to <search-return-data>, from [IMAP-ABNF]
search-return-opt =/ "RELEVANCY"
;; Conforms to <search-return-opt>, from [IMAP-ABNF]
sort-key =/ "RELEVANCY"
8. Security Considerations
Implementation of this extension might enable a denial-of-service
attack if the implementation isn't careful to prevent them. Fuzzy
search engines are often complex with non-obvious disk space, memory
and/or CPU usage patterns. Implementors should test at least the
behavior of large messages that contain very long words and/or unique
random strings. Also very long search keys might cause excessive
memory or CPU usage.
Invalid input may also be problematic. For example if the search
engine takes UTF-8 stream as input, it might fail more or less badly
when illegal UTF-8 sequences are fed to it from a message whose
character set was claimed to be UTF-8. This could be avoided by
validating all the input and, for example, replacing illegal UTF-8
sequences with the Unicode replacement character (U+FFFD).
Search relevancy rankings might be susceptible to "poisoning" by
smart attackers using certain keywords or hidden markup (e.g. HTML)
in their messages to boost the rankings. This can't be fully
prevented by servers, so clients should prepare for it by at least
allowing user to see all the search results, rather than hide results
below a certain score.
Sirainen Expires May 20, 2011 [Page 5]
Internet-Draft IMAP4 FUZZY SEARCH November 2010
9. IANA Considerations
IMAP4 capabilities are registered by publishing a standards track or
IESG approved experimental RFC. The registry is currently located
at:
http://www.iana.org/assignments/imap4-capabilities
This document defines the X-DRAFT-I03-SEARCH=FUZZY [[anchor7: Note to
RFC Editor: fix before publication]] IMAP capability. IANA is
requested to add it to the registry.
10. Acknowledgements
Alexey Melnikov, Zoltan Ordogh, Barry Leiba, Cyrus Daboo and Dave
Cridland have helped with this document.
11. Normative References
[ABNF] Crocker, D., Ed. and P. Overell, "Augmented BNF for Syntax
Specifications: ABNF", RFC 5234, January 2008.
[CONTEXT] Cridland, D. and C. King, "Contexts for IMAP4", RFC 5267,
July 2008.
[ESEARCH] Melnikov, A. and D. Cridland, "IMAP4 Extension to SEARCH
Command for Controlling What Kind of Information Is
Returned", RFC 4731, November 2006.
[IMAP-ABNF]
Melnikov, A. and C. Daboo, "Collected Extensions to IMAP4
ABNF", RFC 4466, April 2006.
[Kwds] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", RFC 2119, March 1997.
[RFC3501] Crispin, M., "INTERNET MESSAGE ACCESS PROTOCOL - VERSION
4rev1", RFC 3501, March 2003.
[RFC5255] Newman, C., Gulbrandsen, A., and A. Melnikov, "Internet
Message Access Protocol Internationalization", RFC 5255,
June 2008.
[SORT] Crispin, M. and K. Murchison, "Internet Message Access
Sirainen Expires May 20, 2011 [Page 6]
Internet-Draft IMAP4 FUZZY SEARCH November 2010
Protocol - SORT and THREAD Extensions", RFC 5256,
June 2008.
Author's Address
Timo Sirainen
Email: tss@iki.fi
Sirainen Expires May 20, 2011 [Page 7]