Distributed Hash Tables (DHTs)

Distributed Hash Tables (DHTs)

An important element of many P2P applications and other distributed applications is an index (that is, a simple database), supporting search and update operations. When this database is distributed, the peers may carry out content caching and complicated routing of queries among themselves. Since information indexing and searching is such a critical component in such systems, we'll now cover one popular indexing
and searching technique, Distributed Hash Tables (DHTs).

Let's therefore consider making a simple distributed database over a large number (perhaps millions) of peers that support simple indexing and querying. The information stored in our database will consist of (key, value) pairs. For instance, the keys could be social security numbers and the values could be the corresponding human names; in this case, an example key-value pair is (156-45-7081, Johnny Wu). Or the keys could be content names (e.g., names of movies, albums and software), and the values could be IP addresses at which the content is stored; in this case, an example key-value pair is (Led Zeppelin IV, 203.17.123.38). Peers query our database by providing the key: If there are (key, value) pairs in the database that match the key, the database returns the matching pairs to the querying peer. So, for instance, if the database stores social security numbers and their corresponding human names, a peer can query a particular social security number, and the database returns the name of the human who has that social security number. Peers also will be able to insert (key, value) pairs into our database. 

Making such a database is simple with a client-server architecture for which all the (key, value) pairs are stored in one central server. This centralized approach was also taken in early P2P systems such as Napster. But the problem is considerably more challenging and interesting in a distributed system consisting of millions of connected peers with no central authority. In a P2P system, we want to distribute the (key, value) pairs across all the peers, so that each peer only holds a small subset of the totality of the (key, value) pairs. One simple approach to making such a P2P database is to (1) randomly scatter the (key, value) pairs across the peers and (2) have each peer maintain a list of the IP addresses of all participating peers. In this way, the querying peer can send a query to all other peers, and the peers containing (key, value) pairs that match the key can respond with their matching pairs. Such an approach is totally unscalable, of course, as it would require each peer to track all other peers (perhaps millions) and, even worse, have each query sent to all peers.

We now explain an elegant approach to designing a P2P database . To this end, let's first assign an identifier to each peer, where each identifier is an integer in the range [0, 2n- 1] for some fixed n. Note that each such identifier can be expressed by an n-bit representation. Let's also require each key to be an integer in the same range. The wise reader may have observed that the example keys explained a little earlier (social security numbers and content names) are not integers. To make integers out of these keys, we will use a hash function that maps each key (e.g, social security number) to an integer in the range [0, 2n - 1]. A hash function is a many-to-one function for which two different inputs can have the same output (same integer), but the possibility of the having the same output is very small. (Readers who are unfamiliar with hash functions may want to visit "Multimedia Networking", in which hash functions are discussed in some detail.) The hash function is assumed to be publicly available to all peers in the system. Hereafter, when we refer to the "key", we are referring to the hash of the original key. So, for instance, if the original key is "Led Zeppelin IV", the key will be the integer that equals the hash of "Led Zeppelin IV". Also, since we are using hashes of keys, rather than the keys themselves, we will henceforth refer to the distributed database as a Distributed Hash Table (DHT).

Let's now look at the problem of storing the (key, value) pairs in the DHT. The main issue here is defining a rule for assigning keys to peers. Given that each peer has an integer identifier and that each key is also an integer in the same range, a natural approach is to assign each (key, value) pair to the peer whose identifier is the closest to the key. To implement such a scheme, we'll need to define what is meant by "closest", for which many conventions are possible. For convenience, let's define the closest peer as the immediate successor of the key. To gain some insight here, let's take a look at a particular example. Assume n = 4 so that all the peer and key identifiers are in the range [0, 15]. Further assume that there are eight peers in the system with identifiers 1, 3, 4, 5, 8, 10, 12, and 15. Finally, assume we want to store the key-value pair (11, Johnny Wu) in one of the eight peers. But in which peer? Using our closest convention, since peer 12 is the immediate successor for key 11, we therefore store the pair (11, Johnny Wu) in the peer 12. [To complete our definition of closest, if the key is exactly equal to one of the peer identifiers, we store the (key-value) pair in that matching peer; and if the key is larger than all the peer identifiers, we use a modulo-2n convention, storing the (key-value) pair in the peer with the smallest identifier.]

Now assume a peer, Alice, wants to insert a (key, value) pair into the DHT. Conceptually, this is simple: She first determines the peer whose identifier is closest to the key; she then sends a message to that peer, instructing it to store the (key, value) pair. But how does Alice determine the peer that is closest to the key? If Alice were to keep track of all the peers in the system (peer IDs and corresponding IP addresses), she could locally determine the closest peer. But such an approach requires each peer to keep track of all other peers in the DHT - which is totally impractical for a large-scale system with millions of peers.




Tags

index, distributed applications, peers, p2p database

Copy Right

The contents available on this website are copyrighted by TechPlus unless otherwise indicated. All rights are reserved by TechPlus, and content may not be reproduced, published, or transferred in any form or by any means, except with the prior written permission of TechPlus.