Caching Mechanisms in Layered DNS Search Services. October 2002


Shi Xiaohong             Expires - April 2003                 [Page 1]

Internet Draft                                             Xiaohong SHI
                                                              Karen LIU
Document: draft-xhshi-dns-search-caching-00.txt             Inter China
                                                       Network Software
                                                               Co., Ltd
Expires: April 2003                                        October 2002


             Caching Mechanisms in Layered DNS Search Services

Status of this Memo

   This document is an Internet-Draft and is in full conformance with
all provisions of Section 10 of RFC2026.

   Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that other
groups may also distribute working documents as Internet-Drafts.

   Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time.  It is inappropriate to use Internet-Drafts as reference material
or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/ietf/1id-abstracts.txt.

   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html.

Copyright Notice
   Copyright (C) The Internet Society (2002).  All Rights Reserved.


Table of Contents
1.  Abstract
2.  Conventions used in this document
3.  Searching query distribution in Keyword Services
4.  Caching mechanisms used in Keyword services
5.  Future consideration of caching mechanisms
6.  Conclusion
7.  References
8.  Acknowlegement
9.  Authors' Addresses


1. Abstract

John Klensin proposed a multi-layered search-based access model for the
DNS to address more human-friendly naming and navigation need of the
modern Internet [DNSSEARCH]. When deploying the DNS search system in the
global Internet, stability and scalability are two most important
factors no matter what protocol we finally use. This draft demonstrates
that caching is a significant objective for a stable and scalable DNS
search system. Particularly, this document analyzes statistical
operational data collected from our Keyword service. The collected data
clearly demonstrates the potential high impact of caching in the
resolution process. Based on deployment experience, the document
describes alternative caching methods. From there, basic caching
requirements for DNS Search systems are discussed.

The goal of this document is not to propose some standard caching
mechanisms or methods but to emphasize the necessity of caching for DNS
Search systems. Furthermore, the authors believe that the proven high
efficiency of caching should be an important consideration when
designing the future lookup protocol for such systems.


2. Conventions used in this document

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED",  "MAY", and "OPTIONAL" in this
document are to be interpreted as described in [RFC2119 ].


3. Searching query distribution in Keyword Services

Searching and navigation are the mostly common used behavior on the
Internet. To facilitate the search on the Internet, there are kinds of
similar services implemented, such as Keyword service, specialized
search service, yellow page services, and free-text search engines,
corresponding to the different layers of the DNS-search framework. Since
[DNSSEARCH] is to establish a multi-layered identifying and searching
system, the user query characteristics of some existing search services
will obviously influence the design of the whole system.

Based on the analysis of practical access log of 3721's Keyword service
and some free-text search engines, it is found that the distribution of
the search items complies with the well-known 80%-20% distribution rule,
i.e., more than 80% of search traffic arises from less than 20% search
items(or even more extreme than 80:20).

Statistic data of 3721's keyword service can explain what is above
discussed. The table below is the search frequency distribution collected
from the log files of the searched items generated from the address bar
of web browsers during July 20th, 2002 to July 26th, 2002.

All the query keywords from the log files are filtered through a
normalization process, thus figure out the equivalent keyword regardless
of the format. The normalization process includes striping quotes,
removing undesired characters such as control characters, turning
multiple spaces into single space, lowering casing, and converting TC to
SC[TC/SC], etc.(Note that the average total searched items per day does
not count in the keyword which is only searched once in a day)

    +---------------------------------------------------------------+
    |                  | Search traffic / day |  item%  | traffic%  |
    +---------------------------------------------------------------+
    |   Top5000 items  |       14184304       |  5.73%  |   89.17%  |
    +---------------------------------------------------------------+
    |   Top1000 items  |       12711044       |  1.15%  |   79.91%  |
    +---------------------------------------------------------------+
    |   Top500 items   |       12021089       |  0.57%  |   75.57%  |
    +---------------------------------------------------------------+
    |   Top100 items   |       10302625       |  0.11%  |   64.77%  |
    +---------------------------------------------------------------+
    |   Average total searched items / day    |        87205        |
    +---------------------------------------------------------------+
    |   Average total search traffic / day    |       15906821      |
    +---------------------------------------------------------------+

>From the above table, we can know that nearly 80% of the search traffic
is due to less than 2% of the query words. Besides, for free-text search
engine, we can also get the search item distribution because 3721's
keyword service has been integrated into almost all the portal and local
portal search engines in China. We get the same distribution and we
believe that this discipline will also be effective in the future.

So the authors believe that there will be the similar situation when
implementing a DNS-searching system. Thus, whichever protocol we finally
use, caching will be highly efficient and necessary.


4. Caching mechanisms used in Keyword services

This section takes 3721 as an example to show some caching mechanisms
used in keyword services, which will be possible options for future DNS-
searching systems.

4.1 Server side caching mechanisms

In implementation of the name resolving or data retrieving software on
server side, some mature cache methods already exists.

(a) In-memory data storage or index

A very important goal of the resolution procee is to eliminate network
delay and name server load from most requests by answering them from its
cache of results. For conventional relational database system, even for
exact match query, once the traffic or load exceeds some threshold,
the overall performance of the system will decline sharply. To guarantee
the high performance as well as the fuzzy search function, the resolving
system of 3721's keyword service takes advantage of in-memory database.
Almost all keywords and indexes based on which the exact and fuzzy match
can be conducted are stored in memory. This kind of in-memory data
storage and index structure can also be regarded as a caching mechanism.
In fact, a resolving server configured with 1G byte memory can store 1
million keywords and indexes, so most of the 28 million queries per day
of 3721's keyword service are finished in memory, which guarantees the
high performance of the system.

(b) Middle cache or proxy server

We can also setup cache or proxy servers between server and client. Many
search engines take the similar way. Besides, some mature caching methods
in other areas can also be used, e.g., using independent software or
hardware for cache, such as the special caching devices of EMC, Network
Appliance or CacheFlow.

In the global Internet environment, when we have very huge data volume,
we probably need more than one server to form a cluster for data
partitioning, and we may also need several clusters for data mirroring.
Under such circumstance, the cache server can also act as the inverse
proxy inside or among the clusters, to fulfill the caching function
together with other goals, such as client requests dispatching, load
balance and fault tolerance.

4.2 Client side caching mechanisms

The typical caching method used in client software is to store the
recently query results in cache. When a user query is matched with one
record in the cache, then it is directly returned as the result, which
will reduce the network traffic for the server. The records stored in
client cache will be refreshed based on TTL(Time-to-live) set by the
server.


Cache size and refresh frequency can both influence the hit rate of
cache. In 3721's keyword client, these two parameters are carefully
designed. Based on the analysis of access log of keyword resolve server
and analysis of the client buffer, we find that the 80%-20% discipline
is also effective for individual user. So we can use an appropriated
size for client cache, e.g.,200 to 300 keyword records. Of course, the
best size of client cache will vary with the popularity or query
frequency of the service. Server should be able to know such change
and can upgrade or notify the client to adjust the size.

For keyword service, because registered names are rarely changed, the
TTL value can be set to a longer time, e.g., several days. Particularly,
the TTL value could be set according to the frequency of changes of a
registration record across the registry. For example, we could measure
the time between changes to a database record (keyword, context, URI
change), and plot that time distribution across the database. The curve
will be a Gaussian curve with a mean "MEAN" and standard deviation
"SIGMA" (or we will regress the curve to that). Then, we could set the
TTL to: MEAN ¿C N * SIGMA (N>3), which would say that the probability
that the cache is hit when the record has changed is pretty low. However,
sometime we should ignore the TTL value and enforce the cache to refresh.
In 3721's keyword client, some special way which is triggered by server
will make part or all cached records expired.


5. Future consideration of caching mechanisms

The future DNS-searching system will be more complicated, various kinds
of services will exist in the same or different layers and they may
inter-operate with each other via a variety of protocols. Cache mechanism
will also be more complex. However, in the author's opinion, there at
least two basic requirements.

(a) caching for perfect or exact match result

To cache the individual result for lookup function. This requirement is
discussed above.

(b) caching for search result list

John Klensin pointed out in [DNSSEARCH] that DNS-searching system should
not only support the lookup function, i.e., one-to-one match, but also
support fuzzy match function which is necessary under some condition even
for layer 2 services. So the cache mechanism should include caching the
fuzzy search result, i.e., the record sets and the corresponding queries
should be cached. Sometimes, the user selection in the list should also
be cached to eliminate the need for further user interference. The design
for caching record sets is somewhat more complicated. The TTL value for a
record set will possibly be smaller than individually cached record, and
the TTL value for user selection in the record set is difficult to set.
It is up to different applications.

For fuzzy match result caching, it may be possible to consider introducing
a string normalization function that is much broader than the one for
perfect match. For example, in English, the fuzzy normalization function
would normalize "movies" into "movie" so that the two search queries
("movie" and "movies") correspond to the same "fuzzy normalized" string
("movie"). This is definitely a different normalization function than
the one we would use for perfect-match (the perfect match normalization
function would identify "movies" and "movie" as two different queries).
Note that we could introduce as many normalization function (fuzzy level
1, fuzzy level 2, etcí¡), depending on how fuzzy we want to be. For each
normalization function, we would have a distinct cache. Each cache would
map a normalized query to the set of records that are identical to that
normalize query, according to that normalization function. That way, the
cache structure does not change, only the normalization function does.


6. Conclusion

Based on the analysis of keyword usage statistic data in China, this
draft discussed the necessity of cache mechanism in DNS search system
proposed by John Klensin. The draft also lists some caching methods
currently used in Chinese keyword services. The validity of these
methods also demonstrate that implementing caching methodology in
DNS-searching will have a good trade-off to make the whole system stable
and scalable while keeping the protocol simple.


7. Reference

[RFC2119]    Bradner, S., "Key words for use in RFCs to Indicate
             Requirement Levels", BCP 14, RFC 2119, March 1997

[DNSSEARCH]  John C. Klensin, "A Search-based access model for the DNS",
             draft-klensin-dns-search-03.txt, May 2001,

[TC/SC]      Xiaodong Lee & al. Traditional and Simplified Chinese
             Conversion, draft-ietf-idn-tsconv-02.txt.


8. Acknowlegement

Discussions with a number of people have led to this document. Especially
some important suggestions have come from conversations with Nicolas Popp,
John Klensin, Patrik Faltstrom, Hongyi Zhou, Shu Cao and Jinqiang Gao.

9. Author's Address

Xiaohong SHI
Inter China Network Software CO., Ltd.
RM.406 He Qiao North Tower, 8A Guanghua Rd., Chaoyang Dist. Beijing,
P.R.China
100026
Phone: +86 10 65812445 ext. 625
E-mail: sxh@3721.com

Karen LIU
Inter China Network Software CO., Ltd.
RM.406 He Qiao North Tower, 8A Guanghua Rd., Chaoyang Dist. Beijing,
P.R.China
100026
Phone: +86 10 65812445 ext. 619
E-mail: karen@3721.com