Programmer Thoughts

By John Dickinson

The Story of an Openstack Feature

December 10, 2010

Openstack is a fairly large open-source project with a set of core developers. Anyone can submit patches for bugfixes or new features, but sometimes the process can be a little mysterious, especially for larger features or for developers that haven’t contributed to open-source projects before.

For the swift project (Openstack storage), we have a mature codebase running in a production environment. Any patches that are accepted must not have adverse effects for the scalability or performance of the system a whole.

One of the features currently being developed for swift is large object support. The feature has gone through many iterations in both design and code, but perhaps the most important development came at the Bexar design summit. As the developers on the project, we knew that files larger than 5GB were important, but we did not have a good use case. We did not want to develop a solution for large files that did not meet the needs of user who would actually want the feature. At the design summit, we were able to talk with Openstack users at NASA who had specific uses in mind for large objects. A launchpad blueprint was developed, and the existing coding work was refocused to meet the needs of the users.

Update: The large object feature has now been approved and lives in the swift trunk.

There are several ways to implement large object support. First, and most simply, is to raise the object size limit constant. The constant determines how big a file can be, but relying on it has limits, and raising it has some nasty side effects. Since an object is stored on one physical drive (per replica), one can only raise the object limit constant to the size of the smallest drive in the cluster. If a cluster is filled with 2TB drives, this means that the largest object can only be 2TB1. Additionally, since objects are spread evenly throughout the cluster, the balance of the fullness of each drive in the cluster is related to the ratio of the max object size to the size of the drives in the cluster. At scale, each drive in a swift cluster will differ from the average fullness of all drives in the cluster by an amount proportional to the max object size.

If simply raising the max object size constant won’t work, another way to support large objects is to split the object into chunks and tell the system to treat groups of objects as one large object. The naive implementation is to split the objects as they are streamed into the system. As an object is loading into swift, swift could split the object after a certain number of bytes and then write the next bytes to a different location in the cluster. This implementation has the advantage of not requiring the user to know anything about object chunks or having a final “commit” step to finalize the large object write. This implementation was actually written, but it was rejected for its enormous complexity and various failure condition edge cases. One of its biggest disadvantages is that it does not allow for users to upload parts of the large uploads in parallel. Asking a user to upload terabytes of data as one upload simply isn’t practical.

After talking to swift users at the Bexar design summit, especially those from NASA, we realized that a large object solution that is implemented with client-side chunking would be sufficient to meet the need of our users and offer some advantages without the disadvantages of the server-side chunking implementation.

A client-side chunking implementation of large objects requires users to upload the chunks of the large objects as normal objects, but with a unique prefix. For example, one could upload three 5GB files (obj/1, obj/2, and obj/3). Then the user creates a manifest object that defines the prefix of the objects (“obj/” in this example). Now, the user can upload or the chunks concurrently, but if the manifest file is downloaded, the system will stream the concatenated chunks to the client. This allows for great flexibility for the user and still allows the system to support very large objects. With the current proposed implementation, the only limit of large files is the size of the cluster itself. If the operators can deploy servers faster than the user can upload the data, the object size is truly unlimited. This is much better than the similar feature in S3 that was announced today. This feature in swift will allow a manifest file to be created for existing content. Additionally, a manifest can be created for a large object, and content can be added to that large object at a later time without updating the manifest. Possible applications beyond simply storing large single objects include streaming all data in a single container to a client as one large object, appending to objects, maintaining sym links to files, and an upload pseudo pause and resume.

This is a feature that will be included in the swift codebase very soon, and is something we are very excited about. We think it balances the needs of the system (scalability and performance) with the requirements of the users. This feature would not have been implemented nearly as well without input from the community. The conversations we had with people at the design summit were invaluable to the design of this feature.

As always, patches are welcome in swift. If you have bug fixes or an idea for new features, we welcome contributions. Talk to us; submit your code; give us your use cases–we want swift to be the best it can be. The swift code is hosted on Launchpad, and the developers can be found on IRC in #openstack on

Community input in the Openstack project is vital. I’m excited about where the project has been, but even more excited to see where the community takes it in the future.

  1. Actually, it would be less than 2TB. A swift cluster is full when the first hard drive in the cluster is full. Therefore, it is wise to limit the fullness of the drives to about 80% of their capacity.

This work is licensed under a Creative Commons Attribution 3.0 Unported License.

The thoughts expressed here are my own and do not necessarily represent those of my employer.