Posts tagged cloud files

Cloud Files CDN Stats

Cloud Files offers public content through Limelight’s CDN network. On public containers, one can opt in to save the logs for all content requested from the CDN. These logs are for the raw usage in an apache log format and are stored compressed in a container named “.CDN_ACCESS_LOGS”. One can then parse these logs with any commercial analytics tool or use a custom solution. Being a developer, I wrote a small Python script that loads these log files and aggregates the data.

The code can be found in my github repository.

After updating the code with your own Cloud Files credentials (or using your own cf_auth module), usage is similar to the following:

1
$ ./cf_stats.py obj_name

“obj_name” is one of the keys the stats can be grouped on. Others include “date”, “container_name”, and “user_agent”. The default is “obj_name” and any incorrect parameter will generate a usage message.

Sample output:

1
2
3
4
5
6
7
8
Object Name: my_file.pdf
Count: 11
User Agents: "Yandex/1.01.001 (compatible; Win16; I)"
Response: 200 304
Referrers: -
IPs: 1.2.3.4 1.2.3.5 1.2.3.6
Dates: 24/Jan/2010 25/Jan/2010 31/Jan/2010 01/Jan/2010 30/Dec/2009
Container Name: some_container

Any of the given fields can be used as a group. Even if the code output as-is is not to your liking, the script’s parsing and grouping functions my be a good starting point for writing your own log parser.

Quickly uploading data to Cloud Files

Cloud Files is a great way to store information, either to take advantage of the CDN or to offload the infrastructure requirements of storing large amounts of data. However Cloud Files is used, though, one still must upload the data to the service before being able to use it.

Uploading the data is not problematic if it can be done in small chunks or spread out over time (images on a blog, for example). The Cloud Files language APIs offer a good way to upload data in these cases. Unfortunately, the language bindings can be terribly slow for uploading large numbers of files. While they do make some optimizations (like reusing connections when available), the code is written to be very generic. For example, the bindings make HEAD requests to ensure all proper data is set before allowing you to upload an object. While this is good in a general sense, these HEAD requests become superfluous when doing a large batch upload. One can achieve much better results by using the Cloud FIles ReST API directly.

As an example, let’s look at the following code which uses the Python API:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
#!/usr/bin/env python
 
import os
import cloudfiles
 
username = 'xxxx'
apikey = 'xxxx'
 
conn = cloudfiles.get_connection(username, apikey)
 
container = conn.create_container('api_speed_test3')
data_list = ('test_data/%s'%x for x in os.listdir('test_data') if x.endswith('.dat'))
for filename in data_list:
    try:
        obj = container.create_object(filename)
        obj.load_from_filename(filename)
    except cloudfiles.errors.ResponseError, err:
        print err
print len(container.list_objects())

In my tests, using the above code takes about 5.5 minutes to upload 1000 16KB files to Cloud Files.

I wrote the same functionality using the ReST API directly:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
#!/usr/bin/python
 
import os
import httplib
 
username = 'xxxx'
apikey = 'xxxx'
 
# auth
conn = httplib.HTTPSConnection('api.mosso.com')
conn.request('GET', '/auth', headers={'x-auth-user': username, 'x-auth-key': apikey})
resp = conn.getresponse()
auth_token = resp.getheader('x-auth-token')
url = resp.getheader('x-storage-url')
conn.close()
# send data
send_headers = {'X-Auth-Token': auth_token, 'Content-Type': 'text/plain'}
container_path = '/'+'/'.join(url.split('/')[3:])+'/api_speed_test2'
conn = httplib.HTTPSConnection(url.split('/')[2])
conn.request('PUT', container_path, headers=send_headers)
conn.getresponse().read()
data_list = ('test_data/%s'%x for x in os.listdir('test_data') if x.endswith('.dat'))
for filename in data_list:
    f = open(filename)
    conn.request('PUT', container_path+'/'+filename, body=f, headers=send_headers)
    f.close()
    resp = conn.getresponse()
    resp.read()
    if resp.status >= 300:
        print resp.status, resp.reason, container_path+'/'+filename
conn.close()

Although slightly longer, the majority of the extra code is for the auth. In my tests, uploading 1000 16KB files took about 4.5 minutes. A whole minute improvement for only 1000 objects is a very good result. I would expect the difference to be even greater as the number of files increases.

All of the code above (plus code to generate the test data) can be found in my github account.

By using the ReST API directly, I can make certain assumptions about my data that are not possible in the generic language bindings. I do not need to do the HEAD requests because I know I have just created the container and I have not uploaded the files yet. I am explicitly setting all the data for each object upload. Further improvements would be to add some error handling and parallelization.

Cloud Files Object Copy

Cloud Files does not currently support object copying. However, a simple workaround is to re-upload the file with the new name. Implementing this workaround may be inconvenient, and one may miss some things like ensuring that metadata is updated. I have added a copy feature to my fork of the python-cloudfiles API that takes care of these details. This is a convenience function only and is not officially supported by Rackspace. Keep in mind that billable bandwidth will be used (unless the servicenet flag is set in the API). One option for renaming large files is to spin up a small Cloud server, use the API to copy over servicenet, and spin down the server. At $0.015 per hour, one could run a 256MB instance for 100 hours before equalling the transfer cost for copying one 5GB (Cloud Files max size) file over the billed network.

My python-cloudfiles fork on github: python-cloudfiles

Example script that copies the last file in a container to another container:

1
2
3
4
5
6
7
8
9
10
11
12
13
import cloudfiles
conn = cloudfiles.get_connection(username='myname', api_key='mykey')
container_name = 'example_container'
another_container = 'example_container2'
c = conn.get_container(container_name)
l = c.list_objects()
o = c.get_object(l[-1])
new_path = '%s/%s' % (another_container, o.name)
o.copy_to(new_path)
print 'copied', l[-1], 'to', new_path
new_list = conn.get_container(another_container).list_objects()
print new_list
assert o.name in new_list