Programmer Thoughts

By John Dickinson

Swift Tech Overview

April 22, 2012

Openstack Object storage, called swift, is a distributed, fault-tolerant, eventually consistent object storage system. In this post, I’d like to go in to some detail about what that means.

Distributed

Swift is a distributed system. It is designed to be run on a cluster of computers rather than on a single machine. Swift is composed of three major parts: the proxy, storage servers, and consistency servers.

Proxy

The proxy server is a server process that provides the swift API. As the only system in the swift cluster that communicates with clients, the proxy is responsible for coordinating with the storage servers and replying to the client with appropriate messages. The proxy is an HTTP server that implements swift’s REST-ful API. All messages to and from the proxy use standard HTTP verbs and response codes. This allows developers building clients to interact with swift in a simple, familiar way.

Swift provides data durability by writing multiple complete replicas of the data stored in the system. The proxy is what coordinates the read and write requests from clients and implements the read and write guarantees of the system. When a client sends a write request, the proxy ensures that the object has been successfully written to disk on the storage nodes before responding with a code indicating success.

Storage Servers

The swift storage servers provide the on-disk storage for the cluster. There are three types of storage servers in swift: account, container, and object. Each of these servers provide an internal REST-ful API. The account and container servers provide namespace partitioning and listing functionality. They are implemented as sqlite databases on disk, and like all entities in swift, they are replicated to multiple availability zones within the swift cluster.

Swift is designed for multi-tenancy. Users are generally given access to a single swift account within a cluster, and they have complete control over that unique namespace. The account server implements this functionality. Users can set metadata on their account, and swift aggregates usage information here. Additionally, the account server provides a listing of the containers within an account.

Swift users may segment their namespace into individual containers. Although containers cannot be nested, they are conceptually similar to directories or folders in a file system. Like accounts, users may set metadata on individual containers, and containers provide a listing of each object they contain. There is no limit to the number of containers that a user may create within a swift account, and the containers do not have globally-unique naming requirements.

Object servers provide the on-disk storage for objects stored within swift. Each object in swift is stored as a single file on disk, and object metadata is stored in the file’s extended attributes. This simple design allows the object’s data and metadata to be stored together and replicated as a single unit.

Consistency Servers

Storing data on disk and providing a REST-ful API to it is not a hard problem to solve. The hard part is handling failures. Swift’s consistency servers are responsible for finding and correcting errors caused by both data corruption and hardware failures.

Auditors run in the background on every swift server and continually scan the disks to ensure that the data stored on disk has not suffered any bit-rot or file system corruption. If an error is found, the corrupted object is moved to a quarantine area, and replication is responsible for replacing the data with a known good copy.

Updaters ensure that account and container listings are correct. The object updater is responsible for keeping the object listings in the containers correct, and the container updaters are responsible for keeping the account listings up-to-date. Additionally, the object updater updates the object count and bytes used in the container metadata, and the container updater updates the object count, container count, and bytes used in the account metadata.

Replicators ensure that the data stored in the cluster is where is should be and that enough copies of the data exist in the system. Generally, the replicators are responsible for repairing any corruption or degraded durability in the cluster.

Fault-tolerant

The combination of swift’s pieces allows a swift cluster to be highly fault-tolerant. Swift implements the concept of availability zones within a single geographic region, and data can be written to hand-off nodes if primary nodes are not available. This allows swift to survive hardware failures up to and including the loss of an entire availability zone with no impact to the end-user.

An interesting consequence of this design is that upgrades and cluster resizes can be easily performed on a production cluster with zero end-user downtime. Swift provides both forward and backwards compatibility of its API, so a swift cluster can be running multiple versions of the swift software at the same time, as is common while the software is being upgraded. Similarly, during resizes, the incongruent data about where data lives is simply seen as a failure. Processes like replication ensure that the data will be moved to its correct location.

Eventually Consistent

Swift achieves high scalability by relaxing constraints on consistency. While swift provides read-your-writes consistency for new objects, listings and aggregate metadata (like usage information) may not be immediately accurate. Similarly, reading an object that has been overwritten with new data may return an older version of the object data. However, swift provides the ability for the client to request the most up-to-date version at the cost of request latency.

Example Request Flow

When an object PUT request is made to swift, the proxy server determines the correct storage nodes responsible for the data (based on a hash of the object name) and sends the object data to those object servers concurrently. If one of the primary storage nodes is unavailable, the proxy will choose an appropriate hand-off node to write data to. If a majority of the object servers respond with a success, then the proxy returns success to the client.

Similarly, when an object GET request is made, the proxy determines which three storage nodes have the data and then requests the data from each node in turn. The proxy will return the object data from the first storage node to respond successfully.

Client Data Designs

Using any storage system effectively means understanding the characteristics of the system and the guarantees that the system provides. Swift is optimized for high concurrency rather than single-stream throughput. The aggregate throughput of a swift cluster is much higher than what is available for a single request stream. A swift client can take advantage of this by distributing data across multiple containers within an account. For example, backups may be stored by day or week in a container that includes that information in its name. Or a photo-sharing application may store images across many containers by using a prefix of the hash of the photo in the container names.

Summary

Swift’s design provides robust software that can run effectively on unreliable (read: cheap) hardware. Modular processes allow deployers to optimize clusters based on client use cases. Fault-tolerance allows clusters to be effectively managed by a limited operations staff.

Swift is production-ready code that has been running at scale powering Rackspace Cloud Files for two years. It is being deployed around the world at large and small scale by public cloud service providers and for private, internal needs. Swift is 100% open source released under the Apache 2.0 license. For more information, you can read the technical docs, the admin guide, or the API guide. To get started building applications for swift, you can use either the stand-alone Python module included in swift’s code or any of Rackspace’s Cloud Files language bindings. If you have further questions, ask on the Openstack mailing list or in #openstack on freenode.

This work is licensed under a Creative Commons Attribution 3.0 Unported License.

The thoughts expressed here are my own and do not necessarily represent those of my employer.