Large Object Support

Overview

Swift has a limit on the size of a single uploaded object; by default this is 5GB. However, the download size of a single object is virtually unlimited with the concept of segmentation. Segments of the larger object are uploaded and a special manifest file is created that, when downloaded, sends all the segments concatenated as a single object. This also offers much greater upload speed with the possibility of parallel uploads of the segments.

Dynamic Large Objects

Middleware that will provide Dynamic Large Object (DLO) support.

Using swift

The quickest way to try out this feature is use the swift Swift Tool included with the python-swiftclient library. You can use the -S option to specify the segment size to use when splitting a large file. For example:

swift upload test_container -S 1073741824 large_file

This would split the large_file into 1G segments and begin uploading those segments in parallel. Once all the segments have been uploaded, swift will then create the manifest file so the segments can be downloaded as one.

So now, the following swift command would download the entire large object:

swift download test_container large_file

swift command uses a strict convention for its segmented object support. In the above example it will upload all the segments into a second container named test_container_segments. These segments will have names like large_file/1290206778.25/21474836480/00000000, large_file/1290206778.25/21474836480/00000001, etc.

The main benefit for using a separate container is that the main container listings will not be polluted with all the segment names. The reason for using the segment name format of <name>/<timestamp>/<size>/<segment> is so that an upload of a new file with the same name won’t overwrite the contents of the first until the last moment when the manifest file is updated.

swift will manage these segment files for you, deleting old segments on deletes and overwrites, etc. You can override this behavior with the --leave-segments option if desired; this is useful if you want to have multiple versions of the same large object available.

Direct API

You can also work with the segments and manifests directly with HTTP requests instead of having swift do that for you. You can just upload the segments like you would any other object and the manifest is just a zero-byte (not enforced) file with an extra X-Object-Manifest header.

All the object segments need to be in the same container, have a common object name prefix, and sort in the order in which they should be concatenated. Object names are sorted lexicographically as UTF-8 byte strings. They don’t have to be in the same container as the manifest file will be, which is useful to keep container listings clean as explained above with swift.

The manifest file is simply a zero-byte (not enforced) file with the extra X-Object-Manifest: <container>/<prefix> header, where <container> is the container the object segments are in and <prefix> is the common prefix for all the segments.

It is best to upload all the segments first and then create or update the manifest. In this way, the full object won’t be available for downloading until the upload is complete. Also, you can upload a new set of segments to a second location and then update the manifest to point to this new location. During the upload of the new segments, the original manifest will still be available to download the first set of segments.

Note

When updating a manifest object using a POST request, a X-Object-Manifest header must be included for the object to continue to behave as a manifest object.

The manifest file should have no content. However, this is not enforced. If the manifest path itself conforms to container/prefix specified in X-Object-Manifest, and if manifest has some content/data in it, it would also be considered as segment and manifest’s content will be part of the concatenated GET response. The order of concatenation follows the usual DLO logic which is - the order of concatenation adheres to order returned when segment names are sorted.

Here’s an example using curl with tiny 1-byte segments:

# First, upload the segments
curl -X PUT -H 'X-Auth-Token: <token>'         http://<storage_url>/container/myobject/00000001 --data-binary '1'
curl -X PUT -H 'X-Auth-Token: <token>'         http://<storage_url>/container/myobject/00000002 --data-binary '2'
curl -X PUT -H 'X-Auth-Token: <token>'         http://<storage_url>/container/myobject/00000003 --data-binary '3'

# Next, create the manifest file
curl -X PUT -H 'X-Auth-Token: <token>'         -H 'X-Object-Manifest: container/myobject/'         http://<storage_url>/container/myobject --data-binary ''

# And now we can download the segments as a single object
curl -H 'X-Auth-Token: <token>'         http://<storage_url>/container/myobject
class swift.common.middleware.dlo.GetContext(dlo, logger)

Bases: swift.common.wsgi.WSGIContext

get_or_head_response(req, x_object_manifest)
Parameters
  • req – user’s request

  • x_object_manifest – as unquoted, native string

handle_request(req, start_response)

Take a GET or HEAD request, and if it is for a dynamic large object manifest, return an appropriate response.

Otherwise, simply pass it through.

Static Large Objects

Middleware that will provide Static Large Object (SLO) support.

This feature is very similar to Dynamic Large Object (DLO) support in that it allows the user to upload many objects concurrently and afterwards download them as a single object. It is different in that it does not rely on eventually consistent container listings to do so. Instead, a user defined manifest of the object segments is used.

Uploading the Manifest

After the user has uploaded the objects to be concatenated, a manifest is uploaded. The request must be a PUT with the query parameter:

?multipart-manifest=put

The body of this request will be an ordered list of segment descriptions in JSON format. The data to be supplied for each segment is either:

Key

Description

path

the path to the segment object (not including account) /container/object_name

etag

(optional) the ETag given back when the segment object was PUT

size_bytes

(optional) the size of the complete segment object in bytes

range

(optional) the (inclusive) range within the object to use as a segment. If omitted, the entire object is used

Or:

Key

Description

data

base64-encoded data to be returned

Note

At least one object-backed segment must be included. If you’d like to create a manifest consisting purely of data segments, consider uploading a normal object instead.

The format of the list will be:

[{"path": "/cont/object",
  "etag": "etagoftheobjectsegment",
  "size_bytes": 10485760,
  "range": "1048576-2097151"},
 {"data": base64.b64encode("interstitial data")},
 {"path": "/cont/another-object", ...},
 ...]

The number of object-backed segments is limited to max_manifest_segments (configurable in proxy-server.conf, default 1000). Each segment must be at least 1 byte. On upload, the middleware will head every object-backed segment passed in to verify:

  1. the segment exists (i.e. the HEAD was successful);

  2. the segment meets minimum size requirements;

  3. if the user provided a non-null etag, the etag matches;

  4. if the user provided a non-null size_bytes, the size_bytes matches; and

  5. if the user provided a range, it is a singular, syntactically correct range that is satisfiable given the size of the object referenced.

For inlined data segments, the middleware verifies each is valid, non-empty base64-encoded binary data. Note that data segments do not count against max_manifest_segments.

Note that the etag and size_bytes keys are optional; if omitted, the verification is not performed. If any of the objects fail to verify (not found, size/etag mismatch, below minimum size, invalid range) then the user will receive a 4xx error response. If everything does match, the user will receive a 2xx response and the SLO object is ready for downloading.

Note that large manifests may take a long time to verify; historically, clients would need to use a long read timeout for the connection to give Swift enough time to send a final 201 Created or 400 Bad Request response. Now, clients should use the query parameters:

?multipart-manifest=put&heartbeat=on

to request that Swift send an immediate 202 Accepted response and periodic whitespace to keep the connection alive. A final response code will appear in the body. The format of the response body defaults to text/plain but can be either json or xml depending on the Accept header. An example body is as follows:

Response Status: 201 Created
Response Body:
Etag: "8f481cede6d2ddc07cb36aa084d9a64d"
Last Modified: Wed, 25 Oct 2017 17:08:55 GMT
Errors:

Or, as a json response:

{"Response Status": "201 Created",
 "Response Body": "",
 "Etag": "\"8f481cede6d2ddc07cb36aa084d9a64d\"",
 "Last Modified": "Wed, 25 Oct 2017 17:08:55 GMT",
 "Errors": []}

Behind the scenes, on success, a JSON manifest generated from the user input is sent to object servers with an extra X-Static-Large-Object: True header and a modified Content-Type. The items in this manifest will include the etag and size_bytes for each segment, regardless of whether the client specified them for verification. The parameter swift_bytes=$total_size will be appended to the existing Content-Type, where $total_size is the sum of all the included segments’ size_bytes. This extra parameter will be hidden from the user.

Manifest files can reference objects in separate containers, which will improve concurrent upload speed. Objects can be referenced by multiple manifests. The segments of a SLO manifest can even be other SLO manifests. Treat them as any other object i.e., use the Etag and Content-Length given on the PUT of the sub-SLO in the manifest to the parent SLO.

While uploading a manifest, a user can send Etag for verification. It needs to be md5 of the segments’ etags, if there is no range specified. For example, if the manifest to be uploaded looks like this:

[{"path": "/cont/object1",
  "etag": "etagoftheobjectsegment1",
  "size_bytes": 10485760},
 {"path": "/cont/object2",
  "etag": "etagoftheobjectsegment2",
  "size_bytes": 10485760}]

The Etag of the above manifest would be md5 of etagoftheobjectsegment1 and etagoftheobjectsegment2. This could be computed in the following way:

echo -n 'etagoftheobjectsegment1etagoftheobjectsegment2' | md5sum

If a manifest to be uploaded with a segment range looks like this:

[{"path": "/cont/object1",
  "etag": "etagoftheobjectsegmentone",
  "size_bytes": 10485760,
  "range": "1-2"},
 {"path": "/cont/object2",
  "etag": "etagoftheobjectsegmenttwo",
  "size_bytes": 10485760,
  "range": "3-4"}]

While computing the Etag of the above manifest, internally each segment’s etag will be taken in the form of etagvalue:rangevalue;. Hence the Etag of the above manifest would be:

echo -n 'etagoftheobjectsegmentone:1-2;etagoftheobjectsegmenttwo:3-4;' \
| md5sum

For the purposes of Etag computations, inlined data segments are considered to have an etag of the md5 of the raw data (i.e., not base64-encoded).

Range Specification

Users now have the ability to specify ranges for SLO segments. Users can include an optional range field in segment descriptions to specify which bytes from the underlying object should be used for the segment data. Only one range may be specified per segment.

Note

The etag and size_bytes fields still describe the backing object as a whole.

If a user uploads this manifest:

[{"path": "/con/obj_seg_1", "size_bytes": 2097152, "range": "0-1048576"},
 {"path": "/con/obj_seg_2", "size_bytes": 2097152,
  "range": "512-1550000"},
 {"path": "/con/obj_seg_1", "size_bytes": 2097152, "range": "-2048"}]

The segment will consist of the first 1048576 bytes of /con/obj_seg_1, followed by bytes 513 through 1550000 (inclusive) of /con/obj_seg_2, and finally bytes 2095104 through 2097152 (i.e., the last 2048 bytes) of /con/obj_seg_1.

Note

The minimum sized range is 1 byte. This is the same as the minimum segment size.

Inline Data Specification

When uploading a manifest, users can include ‘data’ segments that should be included along with objects. The data in these segments must be base64-encoded binary data and will be included in the etag of the resulting large object exactly as if that data had been uploaded and referenced as separate objects.

Note

This feature is primarily aimed at reducing the need for storing many tiny objects, and as such any supplied data must fit within the maximum manifest size (default is 8MiB). This maximum size can be configured via max_manifest_size in proxy-server.conf.

Retrieving a Large Object

A GET request to the manifest object will return the concatenation of the objects from the manifest much like DLO. If any of the segments from the manifest are not found or their Etag/Content-Length have changed since upload, the connection will drop. In this case a 409 Conflict will be logged in the proxy logs and the user will receive incomplete results. Note that this will be enforced regardless of whether the user performed per-segment validation during upload.

The headers from this GET or HEAD request will return the metadata attached to the manifest object itself with some exceptions:

Header

Value

Content-Length

the total size of the SLO (the sum of the sizes of the segments in the manifest)

X-Static-Large-Object

the string “True”

Etag

the etag of the SLO (generated the same way as DLO)

A GET request with the query parameter:

?multipart-manifest=get

will return a transformed version of the original manifest, containing additional fields and different key names. For example, the first manifest in the example above would look like this:

[{"name": "/cont/object",
  "hash": "etagoftheobjectsegment",
  "bytes": 10485760,
  "range": "1048576-2097151"}, ...]

As you can see, some of the fields are renamed compared to the put request: path is name, etag is hash, size_bytes is bytes. The range field remains the same (if present).

A GET request with the query parameters:

?multipart-manifest=get&format=raw

will return the contents of the original manifest as it was sent by the client. The main purpose for both calls is solely debugging.

When the manifest object is uploaded you are more or less guaranteed that every segment in the manifest exists and matched the specifications. However, there is nothing that prevents the user from breaking the SLO download by deleting/replacing a segment referenced in the manifest. It is left to the user to use caution in handling the segments.

Deleting a Large Object

A DELETE request will just delete the manifest object itself. The segment data referenced by the manifest will remain unchanged.

A DELETE with a query parameter:

?multipart-manifest=delete

will delete all the segments referenced in the manifest and then the manifest itself. The failure response will be similar to the bulk delete middleware.

A DELETE with the query parameters:

?multipart-manifest=delete&async=yes

will schedule all the segments referenced in the manifest to be deleted asynchronously and then delete the manifest itself. Note that segments will continue to appear in listings and be counted for quotas until they are cleaned up by the object-expirer. This option is only available when all segments are in the same container and none of them are nested SLOs.

Modifying a Large Object

PUT and POST requests will work as expected; PUTs will just overwrite the manifest object for example.

Container Listings

In a container listing the size listed for SLO manifest objects will be the total_size of the concatenated segments in the manifest. The overall X-Container-Bytes-Used for the container (and subsequently for the account) will not reflect total_size of the manifest but the actual size of the JSON data stored. The reason for this somewhat confusing discrepancy is we want the container listing to reflect the size of the manifest object when it is downloaded. We do not, however, want to count the bytes-used twice (for both the manifest and the segments it’s referring to) in the container and account metadata which can be used for stats and billing purposes.

class swift.common.middleware.slo.SloGetContext(slo)

Bases: swift.common.wsgi.WSGIContext

convert_segment_listing(resp_headers, resp_iter)

Converts the manifest data to match with the format that was put in through ?multipart-manifest=put

Parameters
  • resp_headers – response headers

  • resp_iter – a response iterable

handle_slo_get_or_head(req, start_response)

Takes a request and a start_response callable and does the normal WSGI thing with them. Returns an iterator suitable for sending up the WSGI chain.

Parameters
  • reqRequest object; is a GET or HEAD request aimed at what may (or may not) be a static large object manifest.

  • start_response – WSGI start_response callable

class swift.common.middleware.slo.StaticLargeObject(app, conf, max_manifest_segments=1000, max_manifest_size=8388608, yield_frequency=10, allow_async_delete=False)

Bases: object

StaticLargeObject Middleware

See above for a full description.

The proxy logs created for any subrequests made will have swift.source set to “SLO”.

Parameters
  • app – The next WSGI filter or app in the paste.deploy chain.

  • conf – The configuration dict for the middleware.

  • max_manifest_segments – The maximum number of segments allowed in newly-created static large objects.

  • max_manifest_size – The maximum size (in bytes) of newly-created static-large-object manifests.

  • yield_frequency – If the client included heartbeat=on in the query parameters when creating a new static large object, the period of time to wait between sending whitespace to keep the connection alive.

get_segments_to_delete_iter(req)

A generator function to be used to delete all the segments and sub-segments referenced in a manifest.

Parameters

req – a Request with an SLO manifest in path

Raises
  • HTTPPreconditionFailed – on invalid UTF8 in request path

  • HTTPBadRequest – on too many buffered sub segments and on invalid SLO manifest path

get_slo_segments(obj_name, req)

Performs a Request and returns the SLO manifest’s segments.

Parameters
  • obj_name – the name of the object being deleted, as /container/object

  • req – the base Request

Raises
  • HTTPServerError – on unable to load obj_name or on unable to load the SLO manifest data.

  • HTTPBadRequest – on not an SLO manifest

  • HTTPNotFound – on SLO manifest not found

Returns

SLO manifest’s segments

handle_multipart_delete(req)

Will delete all the segments in the SLO manifest and then, if successful, will delete the manifest file.

Parameters

req – a Request with an obj in path

Returns

swob.Response whose app_iter set to Bulk.handle_delete_iter

handle_multipart_get_or_head(req, start_response)

Handles the GET or HEAD of a SLO manifest.

The response body (only on GET, of course) will consist of the concatenation of the segments.

Parameters
  • req – a Request with a path referencing an object

  • start_response – WSGI start_response callable

Raises

HttpException – on errors

handle_multipart_put(req, start_response)

Will handle the PUT of a SLO manifest. Heads every object in manifest to check if is valid and if so will save a manifest generated from the user input. Uses WSGIContext to call self and start_response and returns a WSGI iterator.

Parameters
  • req – a Request with an obj in path

  • start_response – WSGI start_response callable

Raises

HttpException – on errors

swift.common.middleware.slo.parse_and_validate_input(req_body, req_path)

Given a request body, parses it and returns a list of dictionaries.

The output structure is nearly the same as the input structure, but it is not an exact copy. Given a valid object-backed input dictionary d_in, its corresponding output dictionary d_out will be as follows:

  • d_out[‘etag’] == d_in[‘etag’]

  • d_out[‘path’] == d_in[‘path’]

  • d_in[‘size_bytes’] can be a string (“12”) or an integer (12), but d_out[‘size_bytes’] is an integer.

  • (optional) d_in[‘range’] is a string of the form “M-N”, “M-“, or “-N”, where M and N are non-negative integers. d_out[‘range’] is the corresponding swob.Range object. If d_in does not have a key ‘range’, neither will d_out.

Inlined data dictionaries will have any extraneous padding stripped.

Raises

HTTPException on parse errors or semantic errors (e.g. bogus JSON structure, syntactically invalid ranges)

Returns

a list of dictionaries on success

Direct API

SLO support centers around the user generated manifest file. After the user has uploaded the segments into their account a manifest file needs to be built and uploaded. All object segments, must be at least 1 byte in size. Please see the SLO docs for Static Large Objects further details.

Additional Notes

  • With a GET or HEAD of a manifest file, the X-Object-Manifest: <container>/<prefix> header will be returned with the concatenated object so you can tell where it’s getting its segments from.

  • When updating a manifest object using a POST request, a X-Object-Manifest header must be included for the object to continue to behave as a manifest object.

  • The response’s Content-Length for a GET or HEAD on the manifest file will be the sum of all the segments in the <container>/<prefix> listing, dynamically. So, uploading additional segments after the manifest is created will cause the concatenated object to be that much larger; there’s no need to recreate the manifest file.

  • The response’s Content-Type for a GET or HEAD on the manifest will be the same as the Content-Type set during the PUT request that created the manifest. You can easily change the Content-Type by reissuing the PUT.

  • The response’s ETag for a GET or HEAD on the manifest file will be the MD5 sum of the concatenated string of ETags for each of the segments in the manifest (for DLO, from the listing <container>/<prefix>). Usually in Swift the ETag is the MD5 sum of the contents of the object, and that holds true for each segment independently. But it’s not meaningful to generate such an ETag for the manifest itself so this method was chosen to at least offer change detection.

Note

If you are using the container sync feature you will need to ensure both your manifest file and your segment files are synced if they happen to be in different containers.

History

Dynamic large object support has gone through various iterations before settling on this implementation.

The primary factor driving the limitation of object size in Swift is maintaining balance among the partitions of the ring. To maintain an even dispersion of disk usage throughout the cluster the obvious storage pattern was to simply split larger objects into smaller segments, which could then be glued together during a read.

Before the introduction of large object support some applications were already splitting their uploads into segments and re-assembling them on the client side after retrieving the individual pieces. This design allowed the client to support backup and archiving of large data sets, but was also frequently employed to improve performance or reduce errors due to network interruption. The major disadvantage of this method is that knowledge of the original partitioning scheme is required to properly reassemble the object, which is not practical for some use cases, such as CDN origination.

In order to eliminate any barrier to entry for clients wanting to store objects larger than 5GB, initially we also prototyped fully transparent support for large object uploads. A fully transparent implementation would support a larger max size by automatically splitting objects into segments during upload within the proxy without any changes to the client API. All segments were completely hidden from the client API.

This solution introduced a number of challenging failure conditions into the cluster, wouldn’t provide the client with any option to do parallel uploads, and had no basis for a resume feature. The transparent implementation was deemed just too complex for the benefit.

The current “user manifest” design was chosen in order to provide a transparent download of large objects to the client and still provide the uploading client a clean API to support segmented uploads.

To meet an many use cases as possible Swift supports two types of large object manifests. Dynamic and static large object manifests both support the same idea of allowing the user to upload many segments to be later downloaded as a single file.

Dynamic large objects rely on a container listing to provide the manifest. This has the advantage of allowing the user to add/removes segments from the manifest at any time. It has the disadvantage of relying on eventually consistent container listings. All three copies of the container dbs must be updated for a complete list to be guaranteed. Also, all segments must be in a single container, which can limit concurrent upload speed.

Static large objects rely on a user provided manifest file. A user can upload objects into multiple containers and then reference those objects (segments) in a self generated manifest file. Future GETs to that file will download the concatenation of the specified segments. This has the advantage of being able to immediately download the complete object once the manifest has been successfully PUT. Being able to upload segments into separate containers also improves concurrent upload speed. It has the disadvantage that the manifest is finalized once PUT. Any changes to it means it has to be replaced.

Between these two methods the user has great flexibility in how (s)he chooses to upload and retrieve large objects to Swift. Swift does not, however, stop the user from harming themselves. In both cases the segments are deletable by the user at any time. If a segment was deleted by mistake, a dynamic large object, having no way of knowing it was ever there, would happily ignore the deleted file and the user will get an incomplete file. A static large object would, when failing to retrieve the object specified in the manifest, drop the connection and the user would receive partial results.