In the last post I went over using the Kaggle CLI to run locally developed files. I prefer to create and edit scripts in my local development environment rather than using the Kaggle web interface (nice as it is). This post is an extension of the CLI commands that were introduced to make it easier to write Kaggle kernels.
The unit of development on Kaggle is the kernel, and a kernel corresponds to one file. For python development there are three different kinds of kernel:
- A notebook
- A script
- A utility script
The notebook and script are used to produce output, which may be your submission to the competition or something that will be used by another kernel. A utility script is a means of organizing shared code as it can be imported like a module. My primary focus in this post is about scripts and utility scripts, these techniques do work with notebooks too.
Objective
So the aim for this post is simple - to be able to organize and write code, and push it up to Kaggle easily. Make will be used to manage pushing the code to Kaggle. This system isn’t perfect and a discussion of the imperfections along with some potential solutions will be offered.
Why use Make?
I have chosen Make to manage pushing the kernels because make can be used to push only the things that have changed. A makefile is made up of targets, and a target can have dependencies on other targets. For example:
run-code : get-dependencies
python code.py
get-dependencies :
pip install -r requirements.txt
Here the run-code
target will run the get-dependencies
target before running python code.py
. This is a good start as it ensures that the dependencies are installed, however we don’t want to run the pip installer every time that we run the code. So how can we avoid running the pip command if the dependencies are already installed?
Make expects targets to be files and will only run a target if the file it corresponds to is older than it’s dependencies. This makes it quite easy to determine if the dependencies are out of date - you just have to use a file to track the last time you installed them. An example might help here:
run-code : get-dependencies
python code.py
get-dependencies : .make/dependencies
.make/dependencies : requirements.txt
pip install -r requirements.txt
touch .make/dependencies
Now the modification time of the .make/dependencies
file is the time that the installer was last run. If the requirements change then the modification time of the requirements.txt
will be more recent than the .make/dependencies
, and so the installer will run again.
This is the core behaviour we can use to determine if we need to push a kernel.
Conditionally Push
So how can we use this to conditionally push a kernel? A kernel is made up of two things - the file to push and metadata about the kernel (in a file called kernel-metadata.json
). If either of these have changed since we last pushed then we should push again.
So lets imagine we have this folder structure:
kernel
- script.py
- kernel-metadata.json
And we want to create a make target called push-kernel
. We can start with the unconditional version:
push-kernel :
kaggle kernels push --path kernel
This works great. To add the conditional behaviour we can track the last push time with a file, .make/kernel
:
push-kernel : .make/kernel
.make/kernel : kernel/script.py kernel/kernel-metadata.json
kaggle kernels push --path kernel
touch .make/kernel
So this looks like a good start!
Unpremeditated Pushing
The conditional push we outline above is great, however it requires work every time we add a new kernel. What we really want is to be able to just define new kernels and then push them with the make command.
To do this we need to have a consistent layout for the different kernels, and then do some make magic to infer the tracking file. Let’s start.
So if we change our kernel so that it is inside a kaggle
folder, then we can be sure that every folder within that corresponds to a kernel that we may want to push. We can imagine having three kernels - script, train and submit. They would then have a layout of:
kaggle
- script
- script.py
- kernel-metadata.json
- train
- train.py
- kernel-metadata.json
- submit
- submit.py
- kernel-metadata.json
The script
kernel contains a utility script that is helpful for training. The train
kernel trains a model on the competition data, using the script
to help with this. It produces a trained model which is then used by submit
to make the final submission.
With this we can see the relationship between our tracking files and the folders they are tracking. For the kaggle/script
folder we would want a tracking file of .make/script.
, so we really just want to be able to change the path a little.
If you recall the order of the targets from the conditional push section, we want to target the .make
file, and then have that depend on the files in the kernel. To achieve this we need to do three things:
- Find the kernels
- Turn those into the associated tracking files
- Make the
push-kernels
target depend on those tracking files
We can do each of these in turn. To get a list of files, we can use the wildcard
make function:
KERNELS = $(wildcard kaggle/*/kernel-metadata.json)
Here we are collecting all of the paths to the kernel-metadata.json
files in the kaggle folder. We are getting those files as they are required to be able to push the script at all, so they must exist.
The next stage is to transform these paths into the .make
paths.
PUSH_KERNELS = $(KERNELS:kaggle/%/kernel-metadata.json=.make/%)
This is a substitution over the entries in the KERNELS
variable, the syntax is $(VARIABLE:PATTERN=REPLACEMENT)
. This uses %
which can match anything and remembers what it matches. Then in the substitution we can refer to it again.
So now we have a list of the target files. What are they targetting though?
Once again we can use pattern matching to infer both the files to check and the folder to push:
.make/% : kaggle/%/*
kaggle kernels push --path kaggle/$*
touch $@
This uses three different means of referring to the original target:
- First we capture part of it with
%
and then refer to that in the dependency list, so that.make/script
depends onkaggle/script/*
- Next we use the same captured part in the command to push with
$*
, so that.make/script
invokeskaggle kernels push --path kaggle/script
- Finally we update the tracking file with
$@
which refers to the current target, so that.make/script
touches.make/script
Complicated huh?
It does work though, and it’s really neat.
Downsides
This puts all of your code in folders that are organized by kernel. If you start referring to your utility script then your development environment won’t be able to resolve it. I’m hoping to fix this soon.
The other downside is that the kernels have dependencies between them, but that isn’t reflected well on kaggle if you update two of them. If we change the script kernel that train depends on then pushing the script will not cause train to run again. Worse, it takes around 15 seconds for a script to “complete execution” before you can reliably use the new version from another kernel. These are tricker problems that require some careful work.
I do think that this helps a lot though.