A Survey on Distributed Hash Tables and Distributed File Systems
This article will define client-network architecture, peer-to-peer network architecture and how many applications use distributed hash tables (DHTs) in their implementations of distributed file systems (DFS). Although this blog is technical, all terms will be explained such that it is accessible to readers with no networking, computer science, or systems background can understand and appreciate the critical design decisions that are made and why these systems are so powerful.
Before diving into DHTs and DFSs, we must lay the groundwork to understand how simple networked systems work and traditional hash tables. If you are familiar with traditional networks and how client-server models generally work, feel free to skip the background information.
Figure 1: Client-Server Model src: https://thehostinginfo.com/
The simplest network system consists of a client (a computer, cell phone, smart device, phone, laptop, etc…) connected to a server via a network typically the internet, but other types of networks exist). Clients can connect to the server through a wired (ethernet) or wireless(WiFi, Ethernet Cable, Cellular Data, etc ).
Whenever a client wants to get or send information to a webpage, they are actually making requests to a server which can be thought of as a highly specialized computer that is built to handle many requests. When the client makes a request to the server, that request travels to a unique identifier. Specifically, every computer has a unique identifier called an IP address that helps two devices send messages back and forth quickly (here’s an estimate, think milliseconds) between IP addresses.
For example, let’s say you issue a search on Youtube.com and search up a video, that request gets routed to an IP address that points to Google’s servers and you will get a response of all the resulting videos. This back and forth communication between a client and server underlies every website that is on the web today. We now discuss the benefits and the drawbacks as a segway into DHTs and DFS as an alternative to this centralized system.
This section was originally going to highlight the pros and cons of client network models however, each benefit can also be argued as a weakness of the system. We will see the results of a client-server model.
As an aside: A lot of these pain points are outdated since many people have moved to relatively affordable cloud computing technologies like Amazon Web Services, Microsoft Azure, and Google Cloud Platform however, there are still interesting consequences of the design. These web services allow websites to host all their infrastructure in the service providers’ data centers and bill their clients for their usage (IaaS). This model has revolutionized computing and is an exponentially growing market.
Centralization of resources and control
- Centralization: easy maintenance, updates, and control of resources
- Access and integrity of data are controlled by a dedicated server
- Scalability: If the server needs more bandwidth, network engineers can build and install more servers for the site
- Traffic Congestion: Since there is only one localized server handling all the requests it has the potential to have significant delay for future requests.
- Centralization results in having to trust network engineers to not be bad agents and compromise sensitive information.
- Availability: If a server goes down, the whole site goes down. Server failures are fairly common
- Scalability is expensive since one needs to purchase an additional server for a surge in demand that may not be permanent.
Key-Value Pairs to hash function to Hash Table ( src: Data Structure – Hash Table)
A hash table is composed of a hash function and a dictionary. A hash function takes in some value and outputs a random key. This function has the property that it will always return the same key for the same value and similar values do not correspond to similar keys. When different values map to the same key, there’s collision and some other spot must be calculated to store that value (chaining or probing are two ways to get around collisions).
With either linear probing or chaining, the element has to be searched by linearly searching the chains or the other slots in the hash table. In the worst case, we would have to search the entire table (which is entirely undesirable).
To query an element in the dictionary, we compute its hash, then look it up on the table. Effective hash tables utilize hash functions that evenly distributes elements to reduce the number of collisions because the correct value must still be searched since it is stored in a different location.
Therefore, whenever implementing a traditional/standard hash table, we must store the entire table in memory and compute the hash for each element that is entered and queried. We want to ensure that we minimize the probability of collisions from our hash functions. There are many approaches to achieve this result.
As data has scaled exponentially, resources to store all this information in a singular hash table is too expensive making it infeasible. This has led to new designs that help with the limited resources.
Distributed Hash Tables (src :Wikipedia )
Distributed Hash Tables
Distributed hash tables (DHTs) work in a similar manner as hash tables however they are optimized for large amounts of data that exist in files that are spread across many clients, also known as distributed file systems. We will now explore the key features that make a DHT an ideal candidates for large decentralized systems
Autonomy and Decentralization
A key difference between traditional hash tables and DHTs lies in how the data is stored. Instead of storing the table in one location, it is instead spread across a network of clients/nodes. Furthermore, this network of clients is self-sustained, it is not controlled by any single entity. The implementation details will be outlined below and later we will see how these distributed hash tables are great for distributed file systems.
Imagine you had a group of friends and you wanted to pool your computational resources to store some data or work on solving a computationally intensive problem. It is likely that your friends will not have their machine running 100% of the time and thus their resources may get interrupted or turned off periodically. This introduces the concept of churn, nodes constantly entering and leaving the network. Designers of DHTs have to design the system to be reliable even when nodes are not reliable in the network. If there are n nodes in the network, DHTs commonly require only log n nodes to be on the network at once for it to be functional. As n gets large, log n grows slowly therefore DHTs are unreliable when networks are small or when fewer than log n nodes are active. Later, we will explain the hashing scheme that makes DHTs fault tolerant.
Since DHTs were designed to handle problems of scale by pooling resources distributed across the Internet, DHTs must work efficiently even as there are thousands or millions of nodes in the network. DHTs are distributed structures and thus must also deal with other issues that come up in distributed systems such as load balancing, data integrity and ensuring that operations such as routing, data storage, and retrieval are performed efficiently. These distributed systems topics are beyond the scope of this post.
A Distributed Hash Table (src:Distributed HashTable)
Architecture and Implementation of DHTs (src: Distributed Hash Tables: Architecture and Implementation)
Let’s understand what operations need to be supported in a hash table and see how implementations of DHTs differ for distributed applications. The data structures underlying a DHT are an abstract keyspace (an abstract concept of possible keys that data can map to), a keyspace partitioning scheme (how to separate the keys amongst the nodes in the network), and an overlay network (a structure that connects the nodes and supports the retrieval of an owner of a key in the keyspace).
The methods that need to be supported by the Hash Table APIs are for some value v in the hash table, we must be able to put the value into the hash table, put(v) and get some value from the table(v), and remove(v).
It may be evident that our previous hash function that maps some value to a single static key however, this will not work for DHTs since the keyspace is spread around the nodes that are constantly changing in the table.
New Hashing options
Consistent Hashing: Consider a distance function f(x1, x2) that defines the distance between keys x1,x2. Each node on the network gets assigned an identification key (xi) and the node ID xi stores all the keys in which the distance from f(xi,xj) is minimized. In the graphic below, we see an implementation in the CHORD protocol that treats nodes as points on a circle. If some clockwise distance between nodes i,j is shorter than the distance between nodes j,k then node i stores all the keys between i,k on the circle.
When a node enters or exits the network, the work that needs to be done is minimal. Namely the distance must be recalculated and updated. The non-adjacent nodes will have larger clockwise distances and thus will not be affected.
There are many other hash functions like rendezvous hashing and locality preserving hashing (left to the reader) however, the invariant amongst these hash functions is that minimal work needs to be done to ensure that keys are consistent for nodes that are constantly entering and exiting the network.
As discussed above, to put an element into the DHT, the hash is calculated and the resulting key is placed where the clockwise distance is minimized between a hash and a node.
To get a value, hash the value, and get the next clockwise node. If the following clockwise node does not exist then query the predecessor node. This recursive process will continue until we can find an existing successor node for the value.
To remove some element from the DHT, we get(v) and then remove it from the table stored in that particular node.
Distributed File System Applications
Now for the fun part: How do we apply these things and where do they exist in the wild?
Historically, peer to peer (P2P) systems are computing networks that share the workload amongst connected devices. Our focus is going to be on distributed file systems for file sharing platforms that implement storage systems distributed amongst nodes in the network. Popular P2P storage systems like Napster, BitTorrent, and Gnutella were some of the earliest forms of file sharing networks. While earlier models required a centralized server, modern P2P systems do not rely on a centralized server and can provide user anonymity.
If P2P networks are large, these distributed file systems have many benefits such as low cost, low latency, reliable performance, anonymity, and decentralization. Costs are amortized over the number of nodes in the network. Alternatively, in a centralized model, the company managing the server will have to pay for each additional byte of data stored for each customer. (*Note: companies have realized this and have moved to a distributed storage system model.)
Implementation src: Distributed File Systems (DFS)
This is a gross simplification but to store a file in a distributed file system we must use a distributed data store. This requires that multiple versions of a file exist on several different nodes since nodes are ephemeral. There is a tradeoff between replication and reliability/performance in distributed file systems. Namely, if all nodes on the network had a copy of every file transferred through the network then the network would never fail however, it is infeasible to store all files in every node since the nodes would have a lot of their storage space taken up. Deciding how many copies to store is an important consideration for DFS designers.
Initialization and Data Transfer
To initialize a DFS, the file system must be mounted onto a client’s local file system however this file system will not behave in the same way as other files in the filesystem since the data blocks are stored amongst the peers in the network. Therefore, the application must allocate a specific file system with a distributed hash table to search up the location of the data that wants to be accessed.
To store data in distributed storage systems, designers must decide what the best unit of storage for their application would be. Units of storage in a distributed file system can either be full files, bytes, or individual blocks (fix sized data segments). If the system has file-sized transfers, then the node storing the file needs to be accessed once and transferred over to the destination node whereas if we had block or byte size transfers then every node that contains data related to the file being transferred must be accessed. The benefits of having byte or block size transfers is that nodes with small storage availability could still be used in the distributed file system but it is typically more reliable to small units of data transferred since many of the nodes are unreliable. Transfers of large files on an unreliable node may not get completed and then need to be restarted on a different node. However, research has shown that most files on a DFS are smaller than a block.
Implementing the operations
Let’s assume that a user can read from a file, write to it, and write it back for the rest of the network to use. With a single user on the DFS, this is relatively straightforward and will behave as expected however, when there are multiple users, this is not the case.
For systems that utilize files as a unit size, two users writing to two different parts of a file will have inconsistencies since the file will reflect their individual changes but not both changes. This is because one of the writes is destroyed by the subsequent writes. In a block-sized system, both changes will be reflected in the final file if each change to the file was made in different blocks otherwise, we would have the same case as before if we were treating a block as a file.
Most distributed file systems do not use byte-sized transfers since the overhead of network data would be larger than a byte and thus the amount of data transferred would be doubled.
Some implementations like the Andrew File System (AFS) version 1 and 2 and Coda implemented whole file semantics while NFS and AFS version 3 implement block level semantics. The majority of the implementations used block size semantics since it captures the sweet spot of reliability, performance, and scalability for most applications. When a file has been transferred over to the local file system it persists in cache so future accesses will be fast.
Caching and Consistency
As a baseline, accessing data from the network is slower than accessing data locally. Ideally we would want to access the file once, make changes, close the file and send it back through the network. This can be achieved through caching. The cache is a small region (~64kb) of memory in the processor. For a single user system, a user opens a file, makes changes and saves the changes however, when multiple users are working on the same file at once there are inconsistency issues bound to happen. Now instead imagine an approach where each client validates the file in their cache. They can check if the file they have stored locally in their cache is consistent with some metadata like a timestamp, checksum, access time or other measure to validate file integrity. If the files are inconsistent then the current user is informed and they can validate the file they have stored in cache. This cache and validate method is used by NFS.
Suppose however, that the server informs the node via a callback when some file is changed in the server. A callback approach has been implemented by modern distributed file systems that want to get around the issue of periodically checking if a file has been modified.
Famous Distributed File System Applications
Applications of Distributed File Systems in the Wild
BitTorrent is a distributed file system where a distributed hash table is used to lookup where information is stored in the network. When a node is downloading a file, they are considered leechers in the network whereas nodes that are sending information to some nodes in the network are considered seeders. Both must be present for the network to be reliable and performant. Many think of BitTorrent as a medium to download copyrighted information but it’s actually a powerful tool to store information using decentralized pooled resources.
“The Hadoop File System (HDFS) is designed to support Hadoop, an open framework for a specialized type of very scalable distributed computing often known as the Map-Reduce paradigm. Both Hadoop and HDFS are based on Google’s papers describing early, presumably landmark, versions of their framework. These days, there are many champions and users of Hadoop. It is worth mentioning Yahoo!, by name, because of their particularly deep involvement and advocacy.” (src: Lecture on DFS).
While Distributed File Systems are a complex beast, it seems like an ideal solution to the growing amount of data being produced and stored daily. It makes problems of scale and high latency a thing of the past. Information does not depend on the status of a single location, instead the information is distributed across the world in a way that makes the storage systems more robust. The cost of storing information significantly decreases since it is amortized over the number of people on the network and since computers are becoming more prevalent, storage space is continuously growing. Furthermore, there is not a centralized entity controlling all the information which is a huge security concern when corporations have access to our personal data that is being stored in the “cloud”.