<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>some thoughts &#187; CDN</title>
	<atom:link href="http://programmerthoughts.com/tags/cdn/feed/" rel="self" type="application/rss+xml" />
	<link>http://programmerthoughts.com</link>
	<description></description>
	<lastBuildDate>Sun, 25 Jul 2010 16:58:54 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0</generator>
		<item>
		<title>Quickly uploading data to Cloud Files</title>
		<link>http://programmerthoughts.com/programming/quickly-uploading-data-to-cloud-files/</link>
		<comments>http://programmerthoughts.com/programming/quickly-uploading-data-to-cloud-files/#comments</comments>
		<pubDate>Sat, 19 Dec 2009 22:24:24 +0000</pubDate>
		<dc:creator>John</dc:creator>
				<category><![CDATA[Cloud Files]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[CDN]]></category>
		<category><![CDATA[cloud files]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[upload]]></category>

		<guid isPermaLink="false">http://programmerthoughts.com/?p=335</guid>
		<description><![CDATA[A custom file uploader can be more efficient than the generic language bindings provided by Cloud Files. I show how to efficiently upload many files to Cloud Files. The code is available in <a href="http://github.com/notmyname/python_scripts/tree/master/cf_speed/">my github account</a>.]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.rackspacecloud.com/cloud_hosting_products/files">Cloud Files</a> is a great way to store information, either to take advantage of the CDN or to offload the infrastructure requirements of storing large amounts of data. However Cloud Files is used, though, one still must upload the data to the service before being able to use it.</p>
<p>Uploading the data is not problematic if it can be done in small chunks or spread out over time (images on a blog, for example). The <a href="http://github.com/rackspace">Cloud Files language APIs</a> offer a good way to upload data in these cases. Unfortunately, the language bindings can be terribly slow for uploading large numbers of files. While they do make some optimizations (like reusing connections when available), the code is written to be very generic. For example, the bindings make HEAD requests to ensure all proper data is set before allowing you to upload an object. Additionally, at least in the Python language bindings, HEAD requests are issued when an instance of an object is created. While this is good in a general sense, these HEAD requests become superfluous when doing a large batch upload. One can achieve much better results by using the Cloud FIles ReST API directly.</p>
<p>As an example, let&#8217;s look at the following code which uses the Python API:</p>

<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
</pre></td><td class="code"><pre class="python" style="font-family:monospace;"><span style="color: #808080; font-style: italic;">#!/usr/bin/env python</span>
&nbsp;
<span style="color: #ff7700;font-weight:bold;">import</span> <span style="color: #dc143c;">os</span>
<span style="color: #ff7700;font-weight:bold;">import</span> cloudfiles
&nbsp;
username = <span style="color: #483d8b;">'xxxx'</span>
apikey = <span style="color: #483d8b;">'xxxx'</span>
&nbsp;
conn = cloudfiles.<span style="color: black;">get_connection</span><span style="color: black;">&#40;</span>username, apikey<span style="color: black;">&#41;</span>
&nbsp;
container = conn.<span style="color: black;">create_container</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'api_speed_test3'</span><span style="color: black;">&#41;</span>
data_list = <span style="color: black;">&#40;</span><span style="color: #483d8b;">'test_data/%s'</span><span style="color: #66cc66;">%</span>x <span style="color: #ff7700;font-weight:bold;">for</span> x <span style="color: #ff7700;font-weight:bold;">in</span> <span style="color: #dc143c;">os</span>.<span style="color: black;">listdir</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'test_data'</span><span style="color: black;">&#41;</span> \
             <span style="color: #ff7700;font-weight:bold;">if</span> x.<span style="color: black;">endswith</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'.dat'</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>
<span style="color: #ff7700;font-weight:bold;">for</span> filename <span style="color: #ff7700;font-weight:bold;">in</span> data_list:
    <span style="color: #ff7700;font-weight:bold;">try</span>:
        obj = container.<span style="color: black;">create_object</span><span style="color: black;">&#40;</span>filename<span style="color: black;">&#41;</span>
        obj.<span style="color: black;">load_from_filename</span><span style="color: black;">&#40;</span>filename<span style="color: black;">&#41;</span>
    <span style="color: #ff7700;font-weight:bold;">except</span> cloudfiles.<span style="color: black;">errors</span>.<span style="color: black;">ResponseError</span>, err:
        <span style="color: #ff7700;font-weight:bold;">print</span> err
<span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #008000;">len</span><span style="color: black;">&#40;</span>container.<span style="color: black;">list_objects</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span></pre></td></tr></table></div>

<p>In my tests, using the above code takes about 5.5 minutes to upload 1000 16KB files to Cloud Files.</p>
<p>I wrote the same functionality using the ReST API directly:</p>

<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
</pre></td><td class="code"><pre class="python" style="font-family:monospace;"><span style="color: #808080; font-style: italic;">#!/usr/bin/python</span>
&nbsp;
<span style="color: #ff7700;font-weight:bold;">import</span> <span style="color: #dc143c;">os</span>
<span style="color: #ff7700;font-weight:bold;">import</span> <span style="color: #dc143c;">httplib</span>
&nbsp;
username = <span style="color: #483d8b;">'xxxx'</span>
apikey = <span style="color: #483d8b;">'xxxx'</span>
&nbsp;
<span style="color: #808080; font-style: italic;"># auth</span>
conn = <span style="color: #dc143c;">httplib</span>.<span style="color: black;">HTTPSConnection</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'api.mosso.com'</span><span style="color: black;">&#41;</span>
headers = <span style="color: black;">&#123;</span><span style="color: #483d8b;">'x-auth-user'</span>: username, <span style="color: #483d8b;">'x-auth-key'</span>: apikey<span style="color: black;">&#125;</span>
conn.<span style="color: black;">request</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'GET'</span>, <span style="color: #483d8b;">'/auth'</span>, headers=headers<span style="color: black;">&#41;</span>
resp = conn.<span style="color: black;">getresponse</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
auth_token = resp.<span style="color: black;">getheader</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'x-auth-token'</span><span style="color: black;">&#41;</span>
url = resp.<span style="color: black;">getheader</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'x-storage-url'</span><span style="color: black;">&#41;</span>
conn.<span style="color: black;">close</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
<span style="color: #808080; font-style: italic;"># send data</span>
send_headers = <span style="color: black;">&#123;</span><span style="color: #483d8b;">'X-Auth-Token'</span>: auth_token, <span style="color: #483d8b;">'Content-Type'</span>: <span style="color: #483d8b;">'text/plain'</span><span style="color: black;">&#125;</span>
container_path = <span style="color: #483d8b;">'/'</span>+<span style="color: #483d8b;">'/'</span>.<span style="color: black;">join</span><span style="color: black;">&#40;</span>url.<span style="color: black;">split</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'/'</span><span style="color: black;">&#41;</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">3</span>:<span style="color: black;">&#93;</span><span style="color: black;">&#41;</span>+<span style="color: #483d8b;">'/api_speed_test2'</span>
conn = <span style="color: #dc143c;">httplib</span>.<span style="color: black;">HTTPSConnection</span><span style="color: black;">&#40;</span>url.<span style="color: black;">split</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'/'</span><span style="color: black;">&#41;</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">2</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span>
conn.<span style="color: black;">request</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'PUT'</span>, container_path, headers=send_headers<span style="color: black;">&#41;</span>
conn.<span style="color: black;">getresponse</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>.<span style="color: black;">read</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
data_list = <span style="color: black;">&#40;</span><span style="color: #483d8b;">'test_data/%s'</span><span style="color: #66cc66;">%</span>x <span style="color: #ff7700;font-weight:bold;">for</span> x <span style="color: #ff7700;font-weight:bold;">in</span> <span style="color: #dc143c;">os</span>.<span style="color: black;">listdir</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'test_data'</span><span style="color: black;">&#41;</span> \
             <span style="color: #ff7700;font-weight:bold;">if</span> x.<span style="color: black;">endswith</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'.dat'</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>
<span style="color: #ff7700;font-weight:bold;">for</span> filename <span style="color: #ff7700;font-weight:bold;">in</span> data_list:
    f = <span style="color: #008000;">open</span><span style="color: black;">&#40;</span>filename<span style="color: black;">&#41;</span>
    conn.<span style="color: black;">request</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'PUT'</span>, container_path+<span style="color: #483d8b;">'/'</span>+filename, body=f,
                 headers=send_headers<span style="color: black;">&#41;</span>
    f.<span style="color: black;">close</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
    resp = conn.<span style="color: black;">getresponse</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
    resp.<span style="color: black;">read</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
    <span style="color: #ff7700;font-weight:bold;">if</span> resp.<span style="color: black;">status</span> <span style="color: #66cc66;">&gt;</span>= <span style="color: #ff4500;">300</span>:
        <span style="color: #ff7700;font-weight:bold;">print</span> resp.<span style="color: black;">status</span>, resp.<span style="color: black;">reason</span>, container_path+<span style="color: #483d8b;">'/'</span>+filename
conn.<span style="color: black;">close</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span></pre></td></tr></table></div>

<p>Although slightly longer, the majority of the extra code is for the auth. In my tests, uploading 1000 16KB files took about 4.5 minutes. A whole minute improvement for only 1000 objects is a very good result. I would expect the difference to be even greater as the number of files increases.</p>
<p>All of the code above (plus code to generate the test data) can be found in <a href="http://github.com/notmyname/python_scripts/tree/master/cf_speed/">my github account</a>.</p>
<p>By using the ReST API directly, I can make certain assumptions about my data that are not possible in the generic language bindings. I do not need to do the HEAD requests because I know I have just created the container and I have not uploaded the files yet. I am explicitly setting all the data for each object upload. Further improvements would be to add some error handling and parallelization.</p>
]]></content:encoded>
			<wfw:commentRss>http://programmerthoughts.com/programming/quickly-uploading-data-to-cloud-files/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>
