- Published on
Top Tips and Scripts You Need to Supercharge Google Colab
- Authors
- Name
- Nathan Peper
I've been playing in Google's Colaboratory (Colab) for many years now, and while it's my favorite freemium platform for Data Science, AI, and ML development, I'll be the first to admit that I'm always trying to find ways to squeeze more out of it. Hopefully, these tips and scripts can help you find ways to get even more value from the platform once you hit the point where you're thinking you have been constrained by Colab's outstanding capabilities.
Colab Overview
Google's Colab is a free Jupyter notebook service that requires no setup, runs in Google's Cloud, and all you need to get started is a Google Account. It allows you to write, execute, save, and share your Jupyter notebooks, while also giving you seamless access to powerful computing resources such as CPUs, GPUs, TPUs, and increased RAM. It comes preinstalled with a number of popular Python libraries, such as Tensorflow, Keras, PyTorch, Scikit-learn, Numpy, Pandas, etc. Notebooks are seamlessly stored in Google Drive in a Colab folder, but you can also open notebooks from and commit back to GitHub as well, and sharing is easy through the use of URLs.
While access to all of these amazing resources is free, Google also offers paid tiers that provide increased usage time (called compute units) of the hardware resources, more powerful GPUs, higher memory, longer session runtimes, and terminal access. As these resources are evolving, you can find more details on their Signup/FAQ page.
Getting Started
Getting started and taking a sneak peek at the platform is as easy as going to the following link: https://colab.research.google.com/.
Even if you don't have a Google account, you can sign up from this page as well.
Configure the Colab Environment
As previously mentioned, you have the ability to configure your hardware settings for the Colab environment to access.
Using the GUI you can navigate to "Runtime -> Change runtime type" to change your use of accelerators and increase RAM if required. After making these changes you can ensure you have the hardware desired before you spend any time configuring the rest of the environment by running the following cell in your notebook.
gpu = !nvidia-smi -L
print(gpu[0])
assert any(x in gpu[0] for x in ['P100', 'V100'])
Additionally, once your environment is configured you can use the following script to see which packages and versions are already installed:
from pkg_resources import working_set
libs = [x.project_name.lower()+' '+x.version for x in working_set]
for lib in sorted(libs):
print(lib)
Reviewing these packages is a great habit to get into as you're pulling example notebooks from other places so that you aren't surprised when the existing code doesn't work as advertised. Lots of packages consider backward compatibility, but you'll also run into the case where the notebook and code you're pulling in is from a more recent version with new features but the Colab environment has older or more stable versions installed by default.
As you configure your environments, you'll become more familiar with the various ways to install and/or import packages to build the required environment and run your analysis. This can become a time consuming process and isn't always needed, such as when you are only restarting the runtime, or kernel.
To help keep your notebooks concise and save unnecessarily wasted time, try creating a dummy file and testing for its presence during your pip install steps:
![ ! -f "pip_installed" ] && pip -q install tensorflow-datasets==4.4.0 tensorflow-addons && touch pip_installed
Similarly, you can also use this same technique to save time by avoiding copying and downloading large data files each time you rerun your notebook:
![ ! -d "my_data" ] && unzip /content/drive/MyDrive/my_data.zip -d my_data
Connect to your data, files, and storage (if needed)
Colab has access to the internet so you are able to download your datasets, files, and storage systems. You can also move your data and files by uploading from and downloading to your local system through the user interface or through simple scripts.
Upload from Local
from google.colab import files
uploaded_file = files.upload()
Download to Local
from google.colab import files
files.download("/path/to/file.ext")
Google Ecosystem Integrations
Colab is provided by Google, so they've made access to their suite of products as easy as possible. Most commonly used is the ability to mount your Google Drive to the Colab system.
from google.colab import drive
drive.mount('/content/drive')
However, Colab does not integrate with all Python notebooks equally! If your workflow is to upload a Jupyter Notebook to GDrive, open in it Colab, and then access your files and data using the script above, you'll have to go through the entire authentication process which wastes a lot of time and gets old quicly with repeated use. However, if you create a notebook from scratch in GDrive, Colab will automatically mount GDrive to the Colab VM as soon as you open the notebook, without the lengthly authentication process!
To do this, create a new Colab notebook from GDrive, then open the existing Jupyter notebook with Colab in another browser tab. You can either do this by uploading your existing notebook to GDrive and then doubleclick to open with Colab or you can simply go to Colab and click on the upload tab. Uploaded notebooks will reside at GDrive://Colab Notebooks/
.
Now clear all outputs of the existing notebook and from Command mode, select all by pressing <SHIFT> + <CMD|CTRL> + A
, then copy by pressing <CMD|CTRL> + C
. Now go to the new Colab notebook and paste everything by pressing <CMD|CTRL> + V
. Now you can select the Mount Drive icon under the files menu, and you will connect to drive without the timely authentication process. Feel free to delete the old Jupyter notebook to avoid confusion in the future.
Beyond data, if you're someone who has created an amazing helper.py
, or your own private Python package
to help speed your workflow, put them all on your GDrive in a folder called /packages
and then add this to your notebook:
import sys
sys.path.append('/content/drive/MyDrive/packages/')
from helper import *
Now there are a few extra considerations to call out when using Colab with Google's storage systems, such as GDrive and Cloud Storage Buckets. If you want to use Google's TPUs to train your models, your data will ahve to be stored in Google's Cloud Storage Bucket. You can still conduct your preprocessing steps within the Colab environment and then use the pre-installed gsutil
utility to transfer the required files prior to initiating each TPU training run with something like this:
!gsutil -m cp -r /root/tensorflow_datasets/my_train_ds/ gs://my_bucket/
Most users of Colab have been burned by Google terminating their Colab session due to inactivity. While it doesn't always make sense to store every model checkpoint during training to GDrive, you'll want to ensure that your final model and data is completely transferred. One way to attempt to ensure the complete transfer of all files to GDrive is to call drive.flush_and_unmount()
at the very end of the notebook.
from google.colab import drive
model.fit(...) # I've got better things to do than watch the progress bar...
model.save('/content/drive/MyDrive/...')
drive.flush_and_unmount()
The completion of the model.save('/content/drive/MyDrive/...')
step does no ensure that all fiels are safely on GDrive and that you can terminate the Colab instance since this data transfer is done asynchronously. Performing the flushing step will help ensure that all transfer is complete and that it is indeed safe to disconnect.
Finally, lots of people are looking for the best possible instances to work on their own projects. Google notices and rewards users with better GPUs that are responsible and considerate by freeing up their accelerator instances once they are done training. So once you've completed your training and saved your model files, add the following cell at the end of your notebook:
from google.colab import runtime
runtime.unassign()
There are also additional seamless ways to use tools such as PyDrive, the Drive REST API, accessing Google Sheets, and Google Cloud Storage. You can see all of the details and some snippets in this example notebook. This is definitely worth a quick read to get familiar with all of the seamless options at your disposal.
GitHub Integration
There are many dated articles on workarounds for Colab and GitHub integration to pull from and push to repositories, but rest assured that you no longer have to worry about this. See this link and when the instructions say to "Save a copy to GitHub" you're really committing and pushing your changes back to the original file with a commit message.
AWS Scripts
Amazon's Simple Storage Service (S3) is an AWS storage solution that provides the ability to store and retrieve any amount of data from anywhere, depending on your internet connection of course. In this solution, data is stored as objects within buckets and there are additional capabilities for users such as in-place query and big data analytics.
Assuming you already have an AWS account, install the latest AWS SDK called Boto3.
!pip install boto3
Once this is installed for the Colab session, you can run a cell to define the S3 resource you want to interact with. For simplicity, you can just enter your key information in the notebook, but I'd recommend getting into the habit of saving your keys in a separate file, adding the file name to your .gitignore file, and then importing the values when you run the script so that they aren't saved directly in a notebook you could be saving with public visibility or sharing with others.
import boto3
BUCKET_NAME = 'xxxxxxxxxx' # replace with your S3 bucket name
# enter authentication credentials
s3 = boto3.resource('s3', aws_access_key_id = 'ENTER YOUR ACCESS KEY',
aws_secret_access_key= 'ENTER YOUR SECRET KEY'
Now you can define the Key with the filename you want to pull from your bucket, rename it for local use, and catch any errors with a try-except.
KEY = 'filetodownload.csv' # replace with your object key
try:
# we are trying to download training dataset from s3 with name `filetodownload.csv`
# to Colab dir with name `training.csv`
s3.Bucket(BUCKET_NAME).download_file(KEY, 'training.csv')
except botocore.exceptions.ClientError as e:
if e.response['Error']['Code'] == "404":
print("The object does not exist.")
else:
raise
Now the file is ready for use in the current Colab session and you can read it with your favorite data packages using something like pandas.read_csv() or newer and more scalable versions of our old favorites like modin.read_csv().
Keep Colab Session Alive
We've all been there. You get familiar with Google Colab, learn how to set up your environment, get all of you data sources connected, packages installed, you know the problem you want to solve, you've done your EDA, you've selected the right features, and you're ready to to train your model. One last <SHIFT> + <ENTER>
and your training loop is off to the races. You've watched a few passes on the training data, nothing has broken, you haven't run out of memory, but now you're getting bored. You walk away to do something else, come back to check on everything, and you sesison has been disconnected....
While there is no official way to keep the session alive, most of the recommended solutions you'll find online have to do with writing a bit of javascript and entering into the browser console.
function ClickConnect(){
{
console.log(“Working”);
document.querySelector(“colab-connect-button”).shadowRoot.getElementById(‘connect’).click();
}
setInterval(ClickConnect,60000)
Connect Local Webcam
Reading saved data from various storage services is great for many use cases, but at some point you might want to start learning more about and playing with computer vision and use a webcam to capture you or your surrounding in near real-time. The Colab team does has a great script ready for you in their advanced outputs notebook.
from IPython.display import display, Javascript
from google.colab.output import eval_js
from base64 import b64decode
def take_photo(filename='photo.jpg', quality=0.8):
js = Javascript('''
async function takePhoto(quality) {
const div = document.createElement('div');
const capture = document.createElement('button');
capture.textContent = 'Capture';
div.appendChild(capture);
const video = document.createElement('video');
video.style.display = 'block';
const stream = await navigator.mediaDevices.getUserMedia({video: true});
document.body.appendChild(div);
div.appendChild(video);
video.srcObject = stream;
await video.play();
// Resize the output to fit the video element.
google.colab.output.setIframeHeight(document.documentElement.scrollHeight, true);
// Wait for Capture to be clicked.
await new Promise((resolve) => capture.onclick = resolve);
const canvas = document.createElement('canvas');
canvas.width = video.videoWidth;
canvas.height = video.videoHeight;
canvas.getContext('2d').drawImage(video, 0, 0);
stream.getVideoTracks()[0].stop();
div.remove();
return canvas.toDataURL('image/jpeg', quality);
}
''')
display(js)
data = eval_js('takePhoto({})'.format(quality))
binary = b64decode(data.split(',')[1])
with open(filename, 'wb') as f:
f.write(binary)
return filename
Summary
I hope these tips and scripts help speed up your learning and abiliyt to squeeze the most out of your experience with Google Colab. Let me know if you have any other favorite tips or ways that you get the most out of this great platform!