meshBlog

Transferring large Datasets to Swift Object Storage through a CLI

By Christina Kraus27. October 2017

Once you discoverd the possibilities of object storage, you may want to migrate your apps and services to it. When mirgrating a service, you have to move all it's data into the new storage for sure. While there is a Swift CLI, there are some problems with the limitations of file and folder sizes which shouldn't be any larger than 10 GB. To avoid running into this limitation, you need to write your own upload script based on the Swift CLI. Follow this tutorial to see how to do that:

Setup

First of all we need the Swift CLI. Hence it's written in Python and messing around with Python versions and libraries is kind of annoying, I prefer to use tools like "Virtualenv" for an isolated Python environment. Make sure that you've installed pip for the following steps.

pip install virtualenv
virtualenv ENV_DIRECTORY_YOU_WANT_YOUR_ENV_IN
source ENV_DIRECTORY_YOU_WANT_YOUR_ENV_IN/bin/activate

The last three commands were used to set up our Python environment. Now we can install and set up the Swift client. Just follow these two commands and you'll be set.

sudo pip install --upgrade setuptools
sudo pip install python-swiftclient
pip install python-keystoneclient

Having installed the CLI, you have to authenticate against the server now. There are different methods to do so, we'll choose the easiest, the configuration through env variables of the shell. If you want to authenticate via HTTP request, feel free to read this OpenStack dock. For the meshcloud OpenStack authentication you'll need to use the Keystone V3 API, as V2 is deprecated and not supported by the meshcloud.

First of all, you have to get access to your OpenStack credentials. If you are a meshcloud customer, you have to log on to the meshPanel, choose your project and the datacenter you want to store your data in. To acces the credentials needed for the Keystone API, you have to click on the last item in the sidebar called "Service User". Here you can create a service user by typing in a description, choosing "OpenStack" as platform and hitting the "Plus" button. After creating a service user, an automatic download starts, providing you with everything you'll need to authenticate. In this file is a Bash script for your operating system. Just copy all instructions of this section and paste it into the terminal you started Virtualenv in. This is it. You should be authenticated against the Swift CLI now.

Before you leave the meshcloud Panel, make sure you have created a Swift bucket. To do so, click on "Objects" in the sidepanel and enter a name of your choice for your container.

Since this is an upload script and uploading is a great task for parallelisation, we use the GNU library parallel to gain perfomance. The installation of parallel is pretty straightforward.

(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash

The Script

Now with a fresh new Swift CLI setup you can play around a bit to check whether everything is working as intended. Now we have to think about what our script should do. We need to upload every single file in the directories and all the corresponding child directories by starting a Swift client instant for every file.

To do so, we create a shell script written as follows:

#!/bin/bash

upload(){
URI=$1
BUCKET=$2
for file in $(find $URI -type f -not -name '.*')
do
$(echo "swift upload $BUCKET $file" >> commands.txt)
done
}
upload $1 $2

parallel < commands.txt
rm commands.txt

So let's step through the script to understand what's happening there. First of all, we created a method called upload which has two parameters. The first parameter is the URI of the folder the script should upload to and the second one is the Swift bucket (e.g. the one you created before) where this stuff should be stored in. Then we iterate through a set of files that is provided by this neat chunk of code batch find $URI -type f -not -name '.*'. The file is searching through a folder hierarchy, we pass our filepath over to it and specify to look for files only with --type f, with the -not -name '.*' excluding dotfiles. If you need other filters, feel free to read the man page of find.

In the body of the loop, we write the string "swift upload $BUCKET $file" to a file called commands.txt. After running through all files, this file contains all commands to upload every single file to the Swift Object Storage. We do that instead of running the commands directly in order to be able to parallelize the computation via "GNU parallel". This tool allows us to keep a constant threadpool and prevents us from hacking things with the very poor synchronisation mechanisms of the shell.

After the method, we execute the first instructions. Firstly, we run the just declared method with the shell parameters $1 & $2. Now it's time to process all our newly created commands in our .txt file. We do so by passing all the commands to GNU parallel which runs the commands in a threadpool of a consistent size. If you want to have a custom amount of threads running, you can specify this with the -j $amount option. If you want more control over the threads, there are a lot of options to be defined in parallel just read through the man page.

After uploading all the files, we delete our file with all instructions and exit the shell script. Now you have to save the script and modify the rights, just like every other script.

You can run it now with

./upload.sh YOUR_FILE_PATH YOUR_BUCKET_NAME

CAUTION: Swift does not know directories. Many tools are using the filename to store the complete path. The Swift CLI uploads the name with the path that is the input. If you need a relative path, make sure to call the script with that exact same path.

Evaluation of the Upload

When transferring a huge amount of application data of an active service, you want to make sure that every file reached it's destination savely. Sadly, there is a huge file size mismatch between the size of the object store and my HFS+ Filesystem which nearly caused a heart attack when I first saw it.

Hence size is not a valid criterion, the first and easiest proof of work is to compare the amount of files in the bucket and the directory. To count the number of files in the directory and every underlying directory you can use the commamd we used in our script with the "wc"-command piped.

find $URI -type f -not -name '.*' | wc -l

Or if you want to rely on another command, you can use this one, which uses the recursive function of ls, strips out directories via grep, removes empty lines via sed and does finally do a word count.

ls -pR /Users/jannikheyl/Downloads/cf.eu-de-darz.msh.host-cc-packages/ | grep -v / | sed ‘/^$/d’ | wc -l

The amount of files in the bucket can be found in the meshPanel at the bottom of the bucket overview or is shown in the commandline by typing:

swift stat YOURBUCKET

Even though Swift uses md5 hashes while uploading and ETag, there is no proper way to receive all hashes of a bucket. If you want to be 100% sure that your data is healthy, download the data you uploaded again, put it into a folder and run it through this script.

#!/bin/bash
origins(){
URI=$1
$(rm files.txt)
$(rm hash_origin.txt)
for file in $(find $URI -type f -not -name '.*')
do
if [[ -f $file ]]; then
$(echo "md5 -q $file >> hash_origin.txt" >> files.txt)
fi
done
}
downloaded(){
URI=$1
$(rm hash_downloaded.txt)
for file in $(find $URI -type f -not -name '.*')
do
if [[ -f $file ]]; then
$(echo "md5 -q $file >> hash_downloaded.txt" >> files.txt)
fi
done

}
origins $1
downloaded $2

parallel -j 20 < files.txt $(cat hash_origin.txt | sort -d > hash_origin1.txt)
$(cat hash_downloaded.txt | sort -d > hash_downloaded1.txt)

rm hash_downloaded.txt
rm hash_origin.txt
echo $(diff hash_origin1.txt hash_downloaded1.txt)

It takes two folders as input, your original folder and the folder downloaded from your Swift storage. It runs a md5 hash on every file (in parallel), writes those hashes to the file, sorts it to be the exact same order and prints a diff. If the diff tells you to have identical files, you can be sure that your files are feeling warm and fuzzy in their new home.

Troubleshooting

If you run into the problem stated below, run batch pip install 'requests[security]' to resolve it.

/usr/local/lib/python2.7/dist-packages/requests/packages/urllib3
/util/ssl_.py:79: InsecurePlatformWarning: A true SSLContext object is not
available. This prevents urllib3 from configuring SSL appropriately and
may cause certain SSL connections to fail. For more information, see
https://urllib3.readthedocs.org/en/latest
/security.html#insecureplatformwarning.