0% found this document useful (0 votes)
41 views

The Disk Proxy

The document discusses the virtual disk proxy, which handles network communication between the Java virtual machine and the virtual disk server. It provides an interface that allows read and write operations to pages. The virtual disk cache stores objects and uses approximation for other objects. It employs a cache hierarchy to fetch objects from neighboring caches or parents. The virtual disk proxy opens a socket connection to the virtual disk cache and writes commands for read and write operations.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views

The Disk Proxy

The document discusses the virtual disk proxy, which handles network communication between the Java virtual machine and the virtual disk server. It provides an interface that allows read and write operations to pages. The virtual disk cache stores objects and uses approximation for other objects. It employs a cache hierarchy to fetch objects from neighboring caches or parents. The virtual disk proxy opens a socket connection to the virtual disk cache and writes commands for read and write operations.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 23

The Disk Proxy

1
Table of Contents

Page no.
ABSTRACT 1
1 INTRODUCTION 2
2 VIRTUAL DISK PROXY 4
2.1 VIRTUAL DISK CACHE 4
2.1.2 CACHE HIERARCHY 6
3 PROXY SERVER 8
3.1 WHAT IS PROXY SERVER? 8
3.2 THE VIRTUAL DISK SERVER 9
4 JAVA VIRTUAL MACHINE 11
5 CONTENT DISTRIBUTED NETWORK (CDN) 12
6 STORE PROXY 16
6.1 WHAT IS STORE? 16
6.2 ORIGINAL STORE LAYOUT 17
7 CONCLUSION 18
8 BIBLIOGRAPHY 19
APPENDIX - POWER POINT SLIDES 20

ABSTRACT

Virtual disk proxy is implemented along with VDC. VDC store objects that
receives large fraction of request. And uses approximation for storing other
objects. Cache hierarchy is used to fetch the object from neighboring caches or
parents. VD Proxy handles all network communication with VD Server. It
provides interface to store for an operation read/write of pages. It opens socket
connection to VDC & write commands indicating read & write operation.

2
3
Chapter 1
Introduction

The Virtual Disk Proxy is implemented in C code and linked in with the
JVM. The Virtual Disk Proxy handles all the network communication with the
Virtual Disk Server. It provides an interface to the Store by allowing read and
writes operations to pages. It hides all the networking code that actually talks to
the Virtual Disk Cache. The Virtual Disk Proxy just opens a single socket
connection to the Virtual Disk Cache and writes commands indicating read and
write operations. It reads the response of the network socket. VDC reads file-
having list of hosts that are also running VDC’s. The Virtual Disk Server is
implemented in Java and runs as a separate Java application.

To avoid unnecessary wait and network traffic virtual disk proxy along
with disk cache & proxy server is implemented.

4
Figure 1: Virtual disk proxy is situated between virtual disk server
and java virtual

machine

5
Chapter 2

The Virtual Disk Proxy

The PLaVa Java Virtual Machine (JVM) was modified to use an interface called
“Store Proxy”, which performs the necessary mapping from objects to the
physical storage. It uses a new module called the Virtual Disk Proxy to read and
write from the Virtual Disk. The Virtual Disk Proxy handles all the network
communication with the Virtual Disk Server. See Figure 1.

2.1) Virtual Disk Cache

The Virtual Disk Cache talks to the Virtual Disk Proxy module on one end
and the Virtual Disk Server on the other (see Figure 1). To keep track of what
pages are being cached, each cache instance maintains a hash table with an entry
for each page. The page entry indicates whether the current instance owns the
page. In this case that cache instance has exclusive access to write to the page. If
the cache instance is not the owner, it may still be able to read from the page. All
readers of a page are guaranteed to receive an invalidation request before the page
gets changed. When the caches are started, one of the caches is specified as the
initial owner of all pages. Whenever a cache needs to read or write a page it
contacts the probable owner of that page. The contacted cache will either give up
the page (if it is the owner), or forward the request to what it thinks is the owner.

6
The idea of ownership being to specific to each page may not be a good
one. DynamO [1,9] uses a hierarchical ownership model, which may have
performance advantages.

The Virtual Disk Cache will request a page from the Virtual Disk Server if
there is no page entry for the page. When pages are written to, the Virtual Disk
Cache will lazily write the copies back to the Server. Writes to the Server are
asynchronous. The cache instance builds a queue of outstanding write requests
and empties the queue as the acknowledgements arrive from the server. When the
server fails to respond a thread that runs periodically will force a resend of the
write requests.

The cache instances use spin locks on pages that are currently being written
to or read from to prevent corruption of data.

By studying several Internet caches and workloads, we derive four basic


design principles for large scale distributed caches

(1) Minimize the number of hops to locate and access data,


(2) Do not slow down misses,
(3) Share data among many caches, and
(4) Cache data close to clients.

A client sends a request to a cache, and if the cache contains the data
requested by a client, it replies with the data. Otherwise, the cache may ask its
neighbors for the data, but if none of the neighbors possess the data, then the
cache sends its request to its parent. This process recursively continues up the
hierarchy until the data is located or the root cache fetches the data from the

7
server specified in the request. The caches then send the data down the hierarchy
to the client, and each cache long the path caches the data. Each VDC (usually
running on separate hosts) reads an initialization file called virtual_disk_cache.rc
which has a list of hosts that are also running VDCs. A sample configuration file
would look like this :
svetlana 20001 default

katja 20001

tamara 28999

tatyana 28999

The Virtual Disk Proxy is implemented in C code and linked in with the JVM. It
provides an interface to the store by allowing read and write operations to pages.
It hides all the networking code that actually talks to the Virtual Disk Cache
When the Proxy module is initialized it scans a file called virtual_disk_proxy.rc
which would look similar to the following:

196.2.82.233 21915
196.2.82.247 21915

These are just IP address and port pairs.

2.1.2) Cache hierarchy

Current web cache systems define a hierarchy of data caches in which data access
proceeds as follows:
A client sends a request to a cache, and if the cache contains the data
requested by a client, it replies with the data. Otherwise, the cache
may ask its neighbors for the data, but if none of the neighbors possess the data,
then the cache sends its request to its parent.

8
This process recursively continues up the hierarchy until the data is located
or the root cache fetches the data from the server specified in the request. The
caches then send the data down the hierarchy to the client, and each cache along
the path caches the data. To reduce wide-area network bandwidth demand and to
reduce the load on Internet information servers, caches resolve misses through
other caches higher in a hierarchy, as illustrated in Figure 2.

Each cache in the hierarchy independently decides whether to fetch the


reference from the object's home site or from its parent or sibling caches, using a
simple resolution protocol

Figure 2 : Cache hierarchy

9
Chapter 3

Proxy Server

3.1) What is proxy server ?

Working in the Internet, you receive many documents from different spots of the
globe. Sometimes it takes very much time to transmit a document, because the
server you need is rather remote or overloaded or Internet channels are
overloaded.

And what if another user has already received this document only 10
minutes ago and you could have taken advantage of this fact if you new how?

That is why we have a proxy-server. What is this server like? This is a


special program, launched on one of our computers, for which we allotted
sizeable disk space.

You can request this server for a certain document. If somebody has
already made such request, you will get the document at once, at a full speed your
modem is capable of.

Even though our proxy-server does not have this document, the server will
request the remote WWW-server, storing the original, give you the document and
save the copy in its disk space. Which is either not so bad, because next time
someone else will get this document without occupying Internet channels, which
will fasten your own work as well.

10
Theory and practice prove that fair quantity of the requests have rather few
documents for an object. The proxy-server solves about 50% of all queries using
directly its own disk.

Besides, WWW-browser (Netscape Navigator or MS Internet Explorer) can


cache documents already. However, usually you do not use large disk space for
that. But even if you assign the whole disk of 1Gb for this purpose, you would
gain more by using our 24Gb cache-space together with thousands people.

And what is more, our server is connected to similar servers of other


Moscow Internet providers. So, the document you need might be found in their
cache and again, it will save your time. Thus, the number of this huge cache space
users keeps increasing.

The proxy-server is available only for the Zenon/Internet network


subscribers using dialup contracts.

Please mind that not all WWW pages work sufficiently via proxy. First of
all, this is true of java applets pages. In order to keep proxy-server in your
browser, use exceptions list.

3.2) The Virtual Disk Server

The Virtual Disk Server is implemented in Java and runs as a separate Java
application. The Virtual disk server runs two threads. The first, called
ProducerThread, listens on a network socket for Datagram packets, puts them on
a queue and calls notify() on that queue. In this manner all requests are given
some sort of order. Packets that arrive first get serviced first. The other thread,
ConsumerThread, does a wait() on the queue that ProducerThread is filling. When

11
it is woken up it checks whether the queue is empty, if it finds a packet on the
queue it pulls it off and examines it. If it is a read request, it is dispatched to
processRead, a write request to processWrite.

Both these methods need to find out where the physical data is for a given
page. This is done by consulting the IndirectionMap.

The IndirectionMap simply handles a file that stores (address, allocation


entry) tuples. It stores them sorted by address. It uses an insertion sort (and
shuffling of the map) to insert new items.

An allocation entry represents the offset in the DataMap. The DataMap is


simply a file full of 4KB pages.

Once a read or write request has found the location of the data, it does its
respective business and then sends a reply to the client. This can be an
acknowledgement of a write, data for a read or a negative acknowledgement for a
failed read operation.

12
Chapter 4

Java Virtual Machine (JVM)

Persistent Java Virtual Machine called PLaVa. In its standard form JVM is
interpreter for bytecode. Bytecode is highly optimized instructions designed to be
executed by java run time system, which is called as Java Virtual Machine (JVM).

Translating a Java program into its bytecode helps makes it much easier to
run a program in a wide variety of environments. The reason is straightforward
only the JVM needs to be implemented for each platform. Once run time package
exists for a given system any Java program can run on it. Although the details of
the JVM will differ form platform to platform, all interpret the same Java
bytecode. If a Java program were compiled to its native code, then different
versions of the same program would have to exist for each type of CPU
connected to the Internet. The execution of every program is under the control of
JVM. The JVM can contain the program and prevent it from generating side
effects outside of the system.

13
Chapter 5

CDN (Content Distributed Network)

Since CDN simulations are known to be highly memory-intensive. The memory-


efficient data structures that stores cache state for a small subset of popular
objects accurately and uses approximations for storing the state for the remaining
objects. Since popular objects receives a large fraction of requests while less
frequently accessed objects consume much of the memory space, this approach
yields large memory savings and reduces errors.

“A content distribution network (CDN) is a collection of proxies that acts


as intermediaries between the origin server and the end clients”. Proxies in a
CDN cache frequently accessed data from origin server and serve requests for
these objects from the proxy closest to the end-user. Proxies can also prefetch,
transform, and process content before delivering it to the end-user. By providing
these services, “a CDN has the potential to reduce the load on origin servers and
the network and also improve client response times.”
In general, simulation of a large CDN is highly memory – and compute
intensive. The memory –intensive nature arise from the need to simulate a disk
cache at each proxy- the large number of objects in the cache the larger the
number of proxies in the CDN, greater the memory requirements for simulating
the CDN. CDNs may consist of hundreds or thousands of proxies [1,14].

14
For example :
Consider a content distribution network with 1000 proxies. Assume that
each proxy processes a million requests per day from its end-users and that
500,000 of these requests are for unique objects(the remaining requests are
assumed to access objects already in the proxy cache). In such a scenario, a day-
long simulation will require the CDN to process a total of 1 billion requests and
maintain cache state for 500 million unique objects(0.5 million objects per proxy
cache). A simulated cache needs to store several pieces of information as part of
the object-specific state; the state includes the object ID , the size of the object, its
last modification time, the type of the object ,etc. thus, if we conservatively
assume that 20 bytes are required to maintain the cache state of each object, then
the total memory requirements for 500 million objects is 10GB. This is beyond
the memory capacity of typical compute-servers(except for very high-end servers
that are available today). Similarly, the computational requirements for processing
a billion requests can overwhelm most servers, necessitating long running
simulations.

15
Figure 3 : A typical Content Distributed network

Consider the content distribution network with N proxies, each of which


acts as an intermediary between servers and the end-users, as shown in figure 3.
We assume that each user sends the requests for web content to a proxy in the
CDN. Each proxy is assumed to maintain a cache of frequently accessed content;
this cache is typically stored on a disk. Upon receiving a request, the proxy
services the request from the local cache (in the event of a cache hit) or by
fetching the requested object from another proxy or the origin server (in the event
of cache miss). Objects fetched upon a cache miss are inserted into the cache
(assuming they are cacheable) for servicing future requests. The specific details
of,
(1) How to service a cache miss (i.e., the policy that determines
whether to fetch the object from another proxy or the server) and
(2) The meta-data information required at the proxy to make such
decisions are CDN- dependent. Similarly, issues such as organization of

16
the CDN into a hierarchy or proxy groups, the degree of co-operation
among proxies to services user requests, the polices used to determine a
suitable proxy to server a particular end-user are also CDN-specific.
To approximate the cache state is to use approximate data structures. For
instance, a simulator could use a bloom filter to determine if an object is present
in the cache. “The bloom filter is a Boolean lookup function that can determine if
an object is present in a set with a high probability”[1,3]. Bloom filter is a
Boolean lookup function that can determine if an object is present in a set.
Consider a simulated proxy cache that is designed to hold up to C distinct
objects. Lets assume that for T popular objects is maintained accurately and state
information for the remaining C-T objects is stored approximately. We assume
that T popular objects are ordered in the cache by the cache replacement policy
(like LRU)[1,3].

M bits

0 1 0 1 0 1 1 - - - - - - - - - - - - - - - - - - - - - - - - - - - -0 1

h1(a) h2(a) h3(a) h4(a)

Figure 4 : A bloom filter with 4 hash functions and m-bit vector

A bloom filter is essentially a lookup function implemented using hash functions,


and can be used to insert, delete and lookup elements. As shown in figure 4, a

17
bloom filter consists of m bit vector and a set of hash functions H = { h1, h2, ….
hk}(here k = 4). Initially all bits are set to zero. To insert an element a, the bit
corresponding to the positions h1(a), h2(a),..hk(a)
Are set to 1. a deletion requires these bit positions to be reset to zero. A lookup
involves examining these bits positions; an element is said to be present in the
cache if these bits are set.

18
Chapter 6

Store Proxy

6.1) What is store ?

The original PlaVa implementation used a very simple store mechanism. The
store itself consisted of a file on the local file-system. The store was accessed
from the JVM by conventional IO calls provided by the operating system.

An important part of the Store Proxy is the layout of objects and their
metadata on the store. We reserve the region from 4TB to 5TB for object entries.
Object entries are 512 bytes each. They contain the length of an object and the
indices of the data blocks storing that object. The data blocks are stored in the
5TB to 7TB region. Each data block is 4K in size. When accessing the persistent
store it is necessary to find objects based on their Persistent Identifier (PID). The
PLaVa JVM expects PIDs to be negative so that it can distinguish them from real
addresses for objects. To generate a PID for an object we simply negate the offset
of the object entry. Thus the object entry at offset 1 would be represented by the
PID -1.

19
6.2) Original Store Layout

The store consists of a single file that stores all objects and data about objects.
Both normal objects and class objects are treated identically by the store. There
are no object categories as can be found in other persistent stores (for example
large objects, small objects, objects with different internal layouts). Each object is
stored directly as an array of bytes in the file. The offset of this array in the file
determines the object's Persistent IDentifier (PID). A PID is means of identifying
an object that has moved to persistent storage. Because the object identity is
directly related to its position in the file, objects cannot be moved without great
difficulty. New objects are simply appended to the store file. This store layout has
the advantages of only having to deal with one file (there are no problems with
keeping separate files concurrent) and ease of adding and accessing objects.
Disadvantages of this layout are lack of garbage collection and fragmentation.
Without garbage collection the store may grow uncontrollably. Because the store
can grow so fast, objects being accessed may become severely fragmented. This
may result, for example, when the root object (at the beginning of the store file)
and new objects (at the end of the store file) are being accessed alternatively. The
resultant overhead of seeking to and fro obviously presents a performance
problem.

20
Chapter 7

Conclusion

The CDN simulation is highly memory-intensive. The memory-efficient


data structure that stores cache state for a small subset of popular objects
accurately and uses approximations for storing the state for remaining objects.

The design intension was based on the following observations :

Popular objects receives large fraction of requests, while less frequently accessed
object s consumes much of memory space.

“Maintaining the accurate state for objects accessed by a large fraction of


the request and approximate state for the remaining objects consuming much of
space yields large memory savings and reduced errors (5-10 % of errors). “

21
BIBILOGRAPHY

[1] Purushottam Kulkarni, Prashant Shenoy and Weibo Gong, ” Scalable


Techniques for Memory–efficient CDN Simulations ”, Computer Science
Dept & ECE Department University of Massachusetts, Amherst MA 01003.

22
APPENDIX

Power Point Slides

A) Outline
B) Introduction

C) Virtual disk proxy

D) Virtual disk cache

E) Proxy server

F) Java virtual machine

G) CDN

H) Store proxy

23

You might also like