The purpose of this tutorial is to show you how to use the Clusterone platform end-to-end. Specifically, we’ll go over how to use Clusterone to:
- Upload data
- Upload code
- Link your data to your code
- Test locally
- Launch a job
- View job progress in TensorBoard
- Download job logs
Download the Data
For this tutorial, we’ll be using the Tiny ImageNet dataset, as it’s small enough to be able to iterate quickly while simultaneously providing enough complexity to demonstrate the full capability of the platform. Tiny ImageNet is similar to the classic ImageNet (ILSVRC) you’re probably familiar with but much smaller since it was initially developed for students to experiment with for Stanford’s CS231 class. Tiny ImageNet has 200 classes, with each class containing 500 training images, 50 validation images, and 50 test images. Each image is 64 by 64 pixels. Go ahead and download the dataset from Kaggle or the CS231N course.
Download the Data
With the dataset in hand, let’s go ahead and upload it to Clusterone. We will create the dataset on Clusterone using the command line interface (CLI).
- Make sure you have downloaded and configured the CLI.
- Make sure you have input your AWS IAM keys but opening the Clusterone Matrix, logging into your account, click on the Clusterone icon in the right top of the screen and selecting Access Keys from the menu.
- It is highly recommended to use virtual environments for all Clusterone projects.
Before we can upload data to Clusterone, we have to create the dataset. There are several ways to do so but we’re going to take the S3 bucket approach. Note that S3 bucket names must be unique. Using the CLI I’m going to create the bucket “clusterone-tiny-imagenet-example”:
$ just create dataset s3 clusterone-tiny-imagenet-example
You can confirm this dataset was created by seeing it in the table generated by the command:
$ just get datasets
At this point, we have created a dataset object on Clusterone but the object is empty. In this case, we can populate the dataset using the AWS S3 CLI.
- Make sure you have downloaded and configured the AWS CLI as described here.
Assuming I downloaded and extracted the Tiny ImageNet dataset to my current directory, the following command can be used to upload the Tiny ImageNet content to our newly created dataset:
$ curl http://cs231n.stanford.edu/tiny-imagenet-200.zip --output tiny-imagenet-200.zip $ tar xvzf tiny-imagenet-200.zip $ aws s3 cp tiny-imagenet-200 s3://clusterone-tiny-imagenet-example --recursive
The upload process can take a few minutes so go grab a cup of coffee.
Let’s confirm the S3 bucket we created has been populated using:
$ aws s3 ls s3://clusterone-tiny-imagenet-example
After the upload process is complete, any jobs you run on Clusterone will be able to utilize the dataset.
Test Code Locally and Upload
We’re going to train a small CNN on the Tiny ImageNet data using PyTorch. For this tutorial, we’re only going to be using a single machine for training and not a distributed environment. Clone the code we’ll be using from github:
$ git clone https://github.com/clusterone/examples.git
Clusterone provides a few utility functions to make the transition from local testing of your code to the Clusterone platform seamless. Specifically, these are the get_logs_path and get_data_path functions, which you see are imported in the very beginning of the train_and_eval.py script.
from clusterone import get_logs_path, get_data_path
These functions recognize whether the code is being executed locally or on the Clusterone platform and adjust the data path to be used accordingly. Read up on the documentation for these functions but by setting,
Your code will set TRAIN_DATA_DIR to “'~/Documents/Scratch/tiny_imagenet/tiny-imagenet-200/train” when run locally and “/data/USERNAME/clusterone-tiny-imagenet-example/train” when run on the Clusterone platform.
Next, we’ll make sure the code runs locally before putting it up on the platform. Clusterone provides the command line utility “just run local” for this purpose. While it’s much more useful for testing distributed jobs, “just run local” can be used to test single node jobs as well. For this tutorial, go ahead and test the code locally using:
$ just run local single --command "-m tiny_imagenet_pytorch.train_and_eval" --env current
Any errors that come up during “just run local” will also arise on the platform.
- Always test your code using “just run local” before running your code on the platform. It will save you a lot of time.
In this “just run local” example, we’ve specified that we’ll be using the current (activated) virtual environment. In this environment, we’ve installed everything in the requirements.txt file. If you open up the requirements.txt file you should find the following contents:
clusterone pillow tensorboard_logger
Notice how even though we are using the torchvision and torch libraries they are not included in the requirements file. This is because when we run a job on Clusterone it ends up getting run inside a highly-optimized framework with all PyTorch libraries pre-installed. By including the pertinent PyTorch libraries in the requirements file they overwrite the pre-installed libraries in the optimized framework we will choose in a bit. This could lead to unexpected behavior.
- Never include libraries pertaining to your deep learning environment in your requirements file.
Assuming we get no errors while running “just run local” process, let’s go ahead and run our code on Clusterone.
- Make sure you have linked your GitHub account to Clusterone as described here.
Go ahead and add any changes you’ve made to the cloned GitHub code and then push them to a repository in your personal GitHub account. At this point, for this example Clusterone has access to all the Tiny ImageNet data we’ll need but it is not aware of the code needed to train the model on the data. We need to link Clusterone to the GitHub repository to which you pushed your code. To do so, we’re going to go ahead and create a Clusterone project in Matrix, the GUI for Clusterone.
Log in to Matrix, click on the Projects tab on the left-hand side and then go ahead and click on the “Add New Project” link on the right-hand side of the screen. Then, click on the “Link GitHub Repository” and start typing in the name of the repository to which you just pushed code to, e.g. username/examples (if it was forked from clusterone/examples) Clusterone now has access to the code necessary to train our Tiny ImageNet example.
You can confirm Clusterone has correctly created your project using the CLI if it appears in the table generated by:
$ just get projects
You can also go to Matrix, click on the Project tab, and then go to the Repository tab:
Launch a Job
We’re now going to use the CLI to launch a job on Clusterone to train the small CNN on Tiny ImageNet. We’re not running a distributed job for this example. To see all the options available for creating a single node job on Clusterone, type the following in your terminal.
$ just run job single --help
For the specific job, we’d like to run, the CLI input would look something like:
$ just run job single \ --name tiny_imagenet_train_gpu \ --datasets USERNAME/clusterone-tiny-imagenet-example \ --command "python -m tiny_imagenet_pytorch.train_and_eval" \ --num_epochs 30 \ --project GITHUB_USERNAME/examples \ --instance-type p3.2xlarge \ --time-limit 4h \ --docker-image pytorch-0.4.0-gpu-py36-cuda9.2 \ --setup-command "pip install -r tiny_imagenet_pytorch/requirements.txt"
Of course, go ahead and modify the command such that it suits your use-case. In this case, we’re going to be training on a p3.2xlarge GPU instance for 30 epochs. Notice how we’re using a PyTorch Docker image with Cuda 9.2 pre-installed. The Docker image used is optimized for the specifications in its title. Namely Cuda 9.2 and PyTorch 0.4. This is why it’s best to not include PyTorch in the requirements file. Note that the input for “--datasets” is the name of the dataset we created earlier to store the Tiny ImageNet data. In case you forgot the name you can retrieve it from the table output by “just get datasets”.
Matrix view provides a lot of useful information about the job we’re running. In Matrix you should see something similar to the following for your job:
Note that the job can take a few minutes until it reaches a “Running” state. The platform will stay in “Pending” mode while it acquires resources and copies data and code to wherever they belong. You can see exactly what the job is working on in the “Events” tab under the job. You can also verify the information you used to create the job in the “Information” tab. We encourage you to look through all the tabs to get familiarized with the platform.
Once the job starts training you’ll see a log.txt file, and tensorboard events file under the “Outputs” tab in Matrix. Obviously, an events file will only appear if your code generates it. In this example, we use the tensorboard_logger library, which is included in requirements.txt, to make PyTorch work with Tensorboard. If an events file is present in the “Outputs” tab we can view training progress live by sliding the “In Tensorboard” button and then clicking the TensorBoard tab in the top right corner of Matrix. For this example, TensorBoard should display something like this:
You can download individual files from Matrix. If you save model checkpoints, you can also download those so long as you save your checkpoints in the log directory using get_log_dir. If you want to download everything under the “Outputs” tab you can use the CLI as follows:
$ just download job c5b1840d-d6a6-4cca-9d46-721f5e3168ab
The identifier above is the job id, which you can obtain using “just get jobs” in the CLI.
Thanks to the CLI, it’s very simple to launch multiple jobs with various configurations. All you have to do is make your code accept command line arguments. This makes clusterone ideal for tasks like hyperparameter tuning.
In this example, we trained a simple CNN on a single GPU node. Clusterone’s real power is in distributed learning. To launch a distributed job on Clusterone the only changes that must be made are in the configuration of “just run job”. Of course, the user has to make sure his/her code is geared for distributed training. We leave this as a challenge for you. Happy distributed learning!
You can sign up and start using Clusterone now: