Paper presented at Time and the Web, Staffordshire University, 19th June 1997.

An adaptive caching and replication mechanism for WWW

Cristian Ionitoiu
Computer Science Department
University of Warwick
Coventry, CV4 7AL, UK
chrisi@dcs.warwick.ac.uk
Maria Angi
Computer Science Department
"Politehnica" University of Timisoara
2 Vasile Parvan, 1900 Timisoara, Romania
maria@cs.utt.ro

Abstract

The paper presents a proposal for a new approach to the public WWW file caching, with the aim to heavily reduce their access latency. It introduces a cache domain as a neighbourhood for file accesses and it alternately contains a migration scheme for copies. It is fully aware of the DNS structure, it adapts well to the present resource deployment within Internet and it is fully user transparent.

1. Introduction

The introduction of WWW technology, by CERN researchers, has started a revolution on the Internet. Until WWW invention, one could consult information either non interactively using ftp/telnet [1] or interactively using gopher [2], but only for plain ASCII files. WWW has imposed a new dimension of both attractiveness and technicality to the Internet interactive access of documents. Statistics have shown the dramatically increasing rate of WWW sites, now WWW being the most used vehicle on Internet for remote access. This was, also, imposed by the URL [3] concept which permits remote access with different protocols within the same HTML [4] browser used usually for http [5] access.

Due to its rapid spreading and to Internet general growth the transfer operations have became very slow and the need for local caching has emerged. The W3 Consortium (W3C), the body supervising Web activity and development, has identified an active research area called "Propagation, Replication and Caching"[5]. The research within this area is supposed to propose solutions to the network bandwidth problem. W3C has issued a problem statement [5] which would guide those researchers interested in the problem. This statement contains: a short description of the problem, its design requirements, and some results and assessment of present solutions. As requirements W3C identifies: efficiency, response time, storage limitations and limited human intervention. It, also, states the key technologies recommended to build solutions to the problem: distributed replication and error recovery, multicast based replication, and cooperative caches.

There are several cache consistency mechanisms currently in use on the Internet for WWW technology: time-to-live (TTL) fields, client polling, and invalidation protocols.

The original WWW proxy [6], [7] was introduced by the CERN researchers into their version of the httpd daemon. It is based on a TTL parameter and on a special request of http protocol used to check if after TTL (expiration the file has been modified or not on the source.

Another solution is based on the idea of splitting the local cache into two caches, one for short term and the other one for long term caching [8]. Alex FTP cache [9] uses client polling to maintain cache consistency. Client polling is a technique where clients periodically check back with the server to determine if cached objects are still valid. In Alex the assumption that new files are modified more frequently than old files governs the policy of the polling algorithm.

Invalidation protocols [10] depend on the server keeping track of cached data; each time an item changes the server notifies the caches that their copies are no longer valid.

An important research project related to several aspects of the Internet technology is Harvest [11] started at the University of Colorado at Boulder. Harvest cache is a multilevel hierarchical cache mapped on the Domain Name System (DNS) tree structure. It introduces the notion of siblings of a certain cache and uses TTL approach.

2. An adaptive caching mechanism

It seems that hierarchical multilevel caching is the most interesting method if it is tuned to use the DNS structure. This explains Harvest success. But, Harvest performance could be further improved if the siblings concept is removed. It works well for small DNS but for large ones either the number of the generated messages is too big or the entire Harvest system has to be replicated on more sites.

The requirement to have a local copy for each file is too strong and if it is enforced caching performance cannot be raised over a certain low threshold. So, in order to obtain improved performance this requirement has to be weaken. Harvest weakens this requirement by introducing its hierarchical caches, but, on their turn, the caches operate for their neighbourhood as a local cache. Thus, the mechanism is only increasing the magnitude and does not scale well with the growth of the DNS. If one needs more caches, the entire Harvest system has to be replicated, as we have already said.


**** see postscript version of paper

Figure 1. Topology of cache domains, creation of a domain (a) and splitting of a domain in case of copy replication (b).


We introduce an algorithm that further weakens the locality concept and introduces a neighbourhood like concept called acache domain. It is fully mapped on the DNS structure and has nice tuneable properties. It is also completely user transparent and tackles all requirements identified by W3C.

To simplify the picture we consider one caching daemon for each DNS, but this correspondence could be one to many, as well. The idea is to have a copy of a file into the neighbourhood if it is not possible to have it in the local cache, and to preserve the copy into the area as much time as possible. We accept some delays during downloading operations but the delays will be small and they will depend only on the local traffic. The aim is to achieve a controlled migration into a certain area of the files as they are required. So, there are no local caches in the local DNS with the exception of the machine that runs the caching daemon.

By a cache domain (CD) we understand a DNS sub-tree within which resides only one copy of a certain remote file. The root node of the tree is considered the root node of the domain.There are two kinds of cache domains: public (PUCD) and private (PRCD). If the cache domain manages more than one file then it is called public, other wise it is called private. The PRCDs have a determined lifetime and the PUCDs could have both determined and infinite lifetime. The operation of the cache domains is based on different ways of treating the copy of a remote file. We introduce three different types of copies:

We consider a generic domain, like .ro domain, as the current cache domain. When a file is downloaded from a remote source a cache entry is constructed. This entry varies in content, but for a freshly brought file has the following information: f, file content existent into the cache directory, r, file record containing: file name, time-to-live (TTL), local lifetime (LLT), name of the downloading node, status flag, semantic information (mostly extracted from the file header), cache domain name, etc, t, downloading time, d, a duration indicating how much time the file will be kept into the cache or will be considered active/used in the current cache. This duration is dynamically calculated based on: file size (fs), hit count (hc), bandwidth (b), available cache space (cs). LLT is obtained by summation of the durations and represents the CD lifetime, this field is zero for PUCDs with infinite lifetime (permanent). After the file has been registered within the cache, the information about this operation is propagated upwards into the DNS structure of the cache domain (see Table 1). One can notice that only file record and downloading time are written into the other cache entries.


Table 1: Evolution of the file f into the .ro cache domain
ee.utt.ro utt.ro .ro
time < d0 (f,r,t,d0) (r,t) (r,t)
d0 <time < d1 (r,t) (f,r,t,d1) (r,t)
d1 <time < d2 (r,t) (r,t) (f,r,t,d2)

There are two approaches to achieve the mobility of the temporary copies. This mobility is important to obtain an overall uniform performance of the access of the single copy in the domain. One solution is to forcefully migrate the temporary copy upwards within the cache domain, and the other is to migrate it only on request if and only if it is not used/active in the current cache.

In the first case, if the current duration, recounted from the last hit, has elapsed the file content is transferred to the upper cache and its cache entry is modified, correspondingly (see Table 1). This operation is the core of the (controlled migration algorithm, which is given below where by CD we understand both PUCD and PRCD:


    if (current duration has finished)
        if (current node is the root of the cache domain)  {
            *delete cache entry completely
            *propagate deletion downwards within the cache domain }
        else {
            *transfer file content into the upper cache
            *modify cache entry accordingly
            *change status to not used and calculate new duration}

Figure 2. Algorithm FM (forceful migration).


One can notice that the algorithm is tuned very well to the present resource topology of Internet. The higher a certain organisation's DNS is situated in the DNS tree the more resources, especially external memory, the organisation manages. Thus, the fact that files tend to move upwardly is of less concern and fits well the cache size availability imposing a low pressure at the leaves of the DNS tree and a high pressure toward root. In the second case, the focus is on the LLT and the status of the temporary copy. A modified algorithm is given below:


    if (current duration (CurDur) has finished)
        if (LLT - CurDur <= 0) {  /* file's LLT has elapsed */
            * delete cache entry completely
            * propagate deletion in the entire cache domain }
        else  /* file still has time to live */
            * set status as not used/active

Figure 3. Algorithm NM (non-migratory).


The difference, beside migration, between FM and NM algorithms is the conditions that activate a deletion decision. If an user from ee.utt.ro (Figure 1) needs the file after the remote downloading operation a volatile or a temporary copy is transferred from the neighbouring node depending on the status flag. This will impose some delay, especially if volatile copies are downloaded repeatedly, but far smaller than it would have been needed to download the file from the remote source. The details of the downloading algorithm are given below:


        if (the file has an entry into the local cache)
            if (the file content exists into the local cache)
                * load the local copy
            else {
                *request owner and status from the upper hierarchy
                if (status is not used)  /* the copy has not been accessed
                                            since the last migration */
                    *download a temporary copy
                else
                    *download a volatile copy }
        else {  /* there is no entry into the local cache */
            * check with upper hierarchy to find out if the file is known
            if (file is known)  {
                *add entry into local cache
                *request owner and status from the upper hierarchy
                if (status is not used)
                    *download a temporary copy
                else
                    *download a volatile copy }
            else  {  /* file not known */
                *download file from the remote source
                *propagate information about the new file upwardly into the cache domain
        }   }

Figure 4. Algorithm D (downloading).


One can easily notice that the operation of the algorithm D is orthogonal with the operation of either algorithm FM or algorithm NM. During each access of the copy the TTL, from the cache entry, is checked against LLT. If LLT is greater than TTL then the copy is blocked while an updating operation takes place. When the file is transferred into the hierarchy we have the following relation

di =fi,fs, hc, b, TTL, di-1)
with the imposed condition di > di-1

The operation and the flexibility of the mechanism depend heavily on the above parameters. The mechanism calculates regularly some of the parameters for transfer in such a way to reflect the network status. In this way the migration process is controlled without human intervention. An empirical relation has to be found to express the dependency of duration on its parameters. This phase of the research could be tackled after the mechanism is implemented and several tests will be executed on different DNS structures.

However, human intervention is required in few instances to set up two thresholds. The transfer duration (d) threshold (MaxD) is the upper limit which if it is passed imposes a replication to take place. The system adminis- trator should decide regularly which is the upper limit of the duration for each cache domain under his jurisdiction. Or, s/he can impose some generic rules which can be automated.

Let write MaxDj as the limit for the j cache domain. Thus, if the current duration di does not verify the relation di < MaxDj then the invoking node takes the decision to download a permanent copy of the file (a replica) and to create a private cache domain having it as the root node. The reintegration of a cache domain is done automatically into the domain which has the parent node of the root of the former cache domain. One can notice that the algorithms depicted in the previous section are generic, so they are not influenced by the spawning and merging of cache domains.

3. Conclusions

In this paper a new adaptive mechanism for the caching and the replication of public WWW files has been present- ed. The paper identifies some challenging unsolved aspects which have to be addressed after the mechanism implementation. The mechanism considers all the requirements stated by W3C.

Acknowledgements

Cristian Ionitoiu is supported by a Soros Foundation and FCO scholarship.

References

[1] J. Postel, J. Reynolds, File transfer protocol,
Internet RFC 959, October 1985.
ftp://ds. internic.net/rfc/rfc959.txt
[2] F. Anklesaria et al,, The Internet Gopher protocol,
Internet RFC 1436, March 1993.
ftp://ds.internic.net/rfc/rfc1436.txt
[3] Uniform Resource Locator
http://www.w3.org/pub/WWW/Addressing/
[4] HyperText Markup Language: working and background material.
http://www.w3.org/pub/WWW/Markup/
[5] Propagation, Replication and Caching on the Web.
http://www.w3.org/pub/WWW/PropagationActivity.html
[6] Ari Luotonen, Kevin Altis, World-Wide Web Proxies,
WWW94 preliminary proceedings, May 1994.
http://www.cern.ch/PapersWWW94/luotonen.ps
[7] J. Gwertzman, M. Seltzer, World-Wide Web Cache Consistency.
http://www.eecs. harvard.edu/~vino/web/usenix.196/
[8] D. Wessels, Intellingent Caching for World-Wide Web Objects,
Proceedings of INET-95, 1995.
http://info.isoc.org:80/HMP/PAPER/139/abst.html
[9] V. Cate, Alex - A Global Filesystem,
Proceedings of the 1992 USENIX File System Workshop, Ann Arbor, MI., May 1992, pg. 1-12.
[10] K. Worell, Invalidation in Large Scale Network Object Caches,
Master's Thesis, University of Colorado, Boulder, 1994.
[11] C. M. Bowman, P.B. Danzig, D.R. Hardy, U. Manber, M.F. Schwartz, and D.P. Wessels,
Harvest: A scalable, customizable, discovery and access system",
Technical Report CU-CS-732-95, University of Colorado, Department of (Computer Science, Boulder, Colorado, March 1995.


Time and Web home page at: http://www.hiraeth.com/conf/web97/