Matthew’s Blog - Using Kaggle Kernels with Git

I’m interested in managing a kaggle competition submission from the cli with git. I like Kaggle and want to make more submissions, I just don’t like editing files through the web interface. It would work much better for me if I could manage files in a project and then upload or download them to Kaggle via the CLI.

This post is an exploration of doing exactly that. I use the kaggle cli a lot and that produces verbose text output. The text is poorly formatted in this post, sorry about that.

The Kaggle CLI is a python command line client for interacting with Kaggle. It is maintained by Kaggle and appears quite comprehensive. I’m confident that what I want to do will be supported.

The first thing that I should do is explore the CLI a bit.

Code

! kaggle

usage: kaggle [-h] [-v] {competitions,c,datasets,d,kernels,k,config} ...
kaggle: error: the following arguments are required: command

Code

! kaggle --help

usage: kaggle [-h] [-v] {competitions,c,datasets,d,kernels,k,config} ...

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit

commands:
  {competitions,c,datasets,d,kernels,k,config}
                        Use one of:
                        competitions {list, files, download, submit, submissions, leaderboard}
                        datasets {list, files, download, create, version, init, metadata, status}
                        config {view, set, unset}
    competitions (c)    Commands related to Kaggle competitions
    datasets (d)        Commands related to Kaggle datasets
    kernels (k)         Commands related to Kaggle kernels
    config              Configuration settings

This is encouraging, as it’s easy to explore the kaggle command in this notebook.

Download from a Kernel

I happen to know that the competition submission is managed by kernels, so lets look at them a bit more.

Code

! kaggle kernels --help

usage: kaggle kernels [-h] {list,init,push,pull,output,status} ...

optional arguments:
  -h, --help            show this help message and exit

commands:
  {list,init,push,pull,output,status}
    list                List available kernels. By default, shows 20 results sorted by hotness
    init                Initialize metadata file for a kernel
    push                Push new code to a kernel and run the kernel
    pull                Pull down code from a kernel
    output              Get data output from the latest kernel run
    status              Display the status of the latest kernel run

Code

! kaggle kernels list

ref                                                      title                                             author               lastRunTime          totalVotes  
-------------------------------------------------------  ------------------------------------------------  -------------------  -------------------  ----------  
ligtfeather/genet-gpu-efficient-network-albumentations   🧠 GENet : GPU Efficient Network + Albumentations  Tanishq Gautam       2021-03-08 10:10:55          20  
ks2019/handy-rl-training-notebook                        Handy RL Training Notebook                        KS                   2021-03-08 16:30:54          13  
tunguz/tps-mar-2021-stacker                              TPS Mar 2021 Stacker                              Bojan Tunguz         2021-03-08 16:52:23           3  
thomaskonstantin/letter-retrieval-molecular-translation  Letter Retrieval | Molecular Translation          Thomas Konstantin    2021-03-08 18:56:35          10  
shikhar07/basic-eda-and-classification-using-knn         Basic EDA and Classification using KNN            Shikhar Sharma       2021-03-08 17:39:14           2  
chandramoulinaidu/spam-classifier-multinomialnb          Spam Classifier(MultinomialNB)                    Chandramouli Naidu   2021-03-08 17:39:48           2  
kugane/classical-algolithm-approach-mtm-lsd-hough        Classical algolithm approach (MTM/LSD/Hough)      nibuiro              2021-03-08 17:06:58           2  
sureshmecad/netflix-moviews-tv-shows                     Netflix moviews & TV Shows                        A SURESH             2021-03-08 16:56:49          21  
khanrahim/fake-news-classification-easiest-99-accuracy   Fake News Classification (Easiest 99% accuracy)   Mohammad Rahim Khan  2021-03-08 07:03:46          15  
fish0731/netflix-data-visualization-analysis             Netflix Data Visualization & Analysis             fish0731             2021-03-08 15:53:14           4  
daidi06/titanic-ensemble-learning                        Titanic Ensemble Learning                         Didier ILBOUDO       2021-03-08 17:41:00           6  
subinium/matplotlib-figure-for-lecture                   [Matplotlib] Figure for Lecture                   Subin An             2021-03-08 09:05:28          35  
heyytanay/in-depth-eda-meta-features-testing             In-depth EDA + Meta-Features + Testing!           Tanay Mehta          2021-03-08 12:30:06           4  
stutisehgal/data-science-diabetes-prediction             Data Science-Diabetes Prediction                  stuti sehgal         2021-03-08 15:42:13           2  
markusdegen/titanic-naive-bayes                          titanic_naive_bayes                               Markus Degen         2021-03-08 19:32:36           9  
ruchi798/mmlm-2021-ncaaw-spread                          MMLM 2021 - NCAAW - Spread                        Ruchi Bhatia         2021-03-08 17:25:19          14  
namanjeetsingh/customerchurn-using-neural-networks       CustomerChurn Using Neural Networks               Namanjeet Singh      2021-03-08 13:05:49           3  
ivangavrilove88/acc-0-999-feature-engineering            ACC = 0.999 (FEATURE ENGINEERING)                 Ivan Gavrilov        2021-03-08 18:09:07           1  
dazzpool/tps-march-2021                                  TPS March 2021 ✨                                  Rohan kumawat        2021-03-08 17:57:31           1  
jaysueno/titanic-comp                                    titanic_comp                                      Jay Sueno            2021-03-08 09:41:54          37

I wonder if I can turn this into a dataframe for easier viewing.

Code

import re
import pandas as pd

text = %sx kaggle kernels list
whitespace_pattern = re.compile(r"\s\s+")
headers = whitespace_pattern.split(text[0])

df = pd.DataFrame([
    dict(zip(headers, whitespace_pattern.split(line)))
    for line in text[2:]
])
df[:3]

So this is nice enough. What I need to be able to work with this is my kernels.

Code

! kaggle kernels list --help

usage: kaggle kernels list [-h] [-m] [-p PAGE] [--page-size PAGE_SIZE]
                           [-s SEARCH] [-v] [--parent PARENT]
                           [--competition COMPETITION] [--dataset DATASET]
                           [--user USER] [--language LANGUAGE]
                           [--kernel-type KERNEL_TYPE]
                           [--output-type OUTPUT_TYPE] [--sort-by SORT_BY]

optional arguments:
  -h, --help            show this help message and exit
  -m, --mine            Display only my items
  -p PAGE, --page PAGE  Page number for results paging. Page size is 20 by default
  --page-size PAGE_SIZE
                        Number of items to show on a page. Default size is 20, max is 100
  -s SEARCH, --search SEARCH
                        Term(s) to search for
  -v, --csv             Print results in CSV format (if not set print in table format)
  --parent PARENT       Find children of the specified parent kernel
  --competition COMPETITION
                        Find kernels for a given competition slug
  --dataset DATASET     Find kernels for a given dataset slug. Format is {username/dataset-slug}
  --user USER           Find kernels created by a given username
  --language LANGUAGE   Specify the language the kernel is written in. Default is 'all'. Valid options are 'all', 'python', 'r', 'sqlite', and 'julia'
  --kernel-type KERNEL_TYPE
                        Specify the type of kernel. Default is 'all'. Valid options are 'all', 'script', and 'notebook'
  --output-type OUTPUT_TYPE
                        Search for specific kernel output types. Default is 'all'.  Valid options are 'all', 'visualizations', and 'data'
  --sort-by SORT_BY     Sort list results. Default is 'hotness'. Valid options are 'hotness', 'commentCount', 'dateCreated', 'dateRun', 'relevance', 'scoreAscending', 'scoreDescending', 'viewCount', and 'voteCount'. 'relevance' is only applicable if a search term is specified.

Code

! kaggle kernels list --user matthewfranglen --competition hubmap-kidney-segmentation

ref                                     title                   author            lastRunTime          totalVotes  
--------------------------------------  ----------------------  ----------------  -------------------  ----------  
matthewfranglen/256x256x4-images        256x256x4 images        Matthew Franglen  2021-02-11 08:42:18           0  
matthewfranglen/hubmap-fast-ai-starter  HuBMAP fast.ai starter  Matthew Franglen  2021-02-16 20:57:52           0

Both of these are copies of existing notebooks that have kindly been provided by community members. Let’s see if I can download the notebooks and upload a script associated with this competition.

Code

! kaggle kernels pull --help

usage: kaggle kernels pull [-h] [-p PATH] [-w] [-m] [kernel]

optional arguments:
  -h, --help            show this help message and exit
  kernel                Kernel URL suffix in format <owner>/<kernel-name> (use "kaggle kernels list" to show options)
  -p PATH, --path PATH  Folder where file(s) will be downloaded, defaults to current working directory
  -w, --wp              Download files to current working path
  -m, --metadata        Generate metadata when pulling kernel

Code

! kaggle kernels pull matthewfranglen/256x256x4-images

Source code downloaded to /home/matthew/Programming/Blog/blog/notebooks/256x256x4-images.ipynb

So that worked and there is now a notebook in this directory.

Is the Kernel Private?

The metadata for the kernel should show this.

Code

! kaggle kernels pull --metadata matthewfranglen/256x256x4-images

Source code and metadata downloaded to /home/matthew/Programming/Blog/blog/notebooks

The metadata is:

{
  "id": "matthewfranglen/256x256x4-images",
  "id_no": 14677191,
  "title": "256x256x4 images",
  "code_file": "256x256x4-images.ipynb",
  "language": "python",
  "kernel_type": "notebook",
  "is_private": true,
  "enable_gpu": false,
  "enable_internet": true,
  "keywords": [
    "art"
  ],
  "dataset_sources": [],
  "kernel_sources": [],
  "competition_sources": [
    "hubmap-kidney-segmentation"
  ]
}

So you can see that it is private. This is a good metadata example for creating a new kernel.

Creating a Kernel

The next thing to check out is creating a kernel associated with the competition and uploading a script.

Code

! kaggle kernels init --help

usage: kaggle kernels init [-h] [-p FOLDER]

optional arguments:
  -h, --help            show this help message and exit
  -p FOLDER, --path FOLDER
                        Folder for upload, containing data files and a special kernel-metadata.json file (https://github.com/Kaggle/kaggle-api/wiki/Kernel-Metadata). Defaults to current working directory

So now I need to come up with some metadata and a script. The script can be very easy, as I just want to establish that the kernel is created and run.

Code

import tempfile
import json
from pathlib import Path

with tempfile.TemporaryDirectory() as directory:
    metadata_file = Path(directory) / "kernel-metadata.json"
    metadata_file.write_text(json.dumps({
        "id": "matthewfranglen/test-kernel",
        "title": "test-kernel",
        "code_file": "test.py",
        "language": "python",
        "kernel_type": "script",
        "is_private": True,
        "enable_gpu": False,
        "enable_internet": False,
    }))
    script_file = Path(directory) / "test.py"
    script_file.write_text("""
from pathlib import Path

Path("/kaggle/working/output.txt").write_text("hello, world!")
""")
    response = %sx ! kaggle kernels init --path {directory}
    print(response)
    print(metadata_file.read_text())

['Kernel metadata template written to: /tmp/tmpwnwsn320/kernel-metadata.json']
{
  "id": "matthewfranglen/INSERT_KERNEL_SLUG_HERE",
  "title": "INSERT_TITLE_HERE",
  "code_file": "INSERT_CODE_FILE_PATH_HERE",
  "language": "Pick one of: {python,r,rmarkdown}",
  "kernel_type": "Pick one of: {script,notebook}",
  "is_private": "true",
  "enable_gpu": "false",
  "enable_internet": "true",
  "dataset_sources": [],
  "competition_sources": [],
  "kernel_sources": []
}

This is not creating the kernel.

It looks like the kaggle kernels init command just creates a sample metadata file. I also don’t see a way to associate it with the competition. I’m going to add the competition to the competition_sources to try to associate it with the competition.

Code

with tempfile.TemporaryDirectory() as directory:
    metadata_file = Path(directory) / "kernel-metadata.json"
    metadata_file.write_text(json.dumps({
        "id": "matthewfranglen/test-kernel",
        "title": "test-kernel",
        "code_file": "test.py",
        "language": "python",
        "kernel_type": "script",
        "is_private": True,
        "enable_gpu": False,
        "enable_internet": False,
        "competition_sources": ["hubmap-kidney-segmentation"],
    }))
    script_file = Path(directory) / "test.py"
    script_file.write_text("""
from pathlib import Path

Path("/kaggle/working/output.txt").write_text("hello, world!")
""")
    response = %sx ! kaggle kernels push --path {directory}
    print(response)

['Kernel version 1 successfully pushed.  Please check progress at https://www.kaggle.com/matthewfranglen/test-kernel']

This time it worked! The output file was created and contained the expected text.

Code

! kaggle kernels list --user matthewfranglen --competition hubmap-kidney-segmentation

ref                                     title                   author            lastRunTime          totalVotes  
--------------------------------------  ----------------------  ----------------  -------------------  ----------  
matthewfranglen/256x256x4-images        256x256x4 images        Matthew Franglen  2021-03-09 02:02:06           0  
matthewfranglen/test-kernel             test-kernel             Matthew Franglen  2021-03-09 02:11:25           0  
matthewfranglen/hubmap-fast-ai-starter  HuBMAP fast.ai starter  Matthew Franglen  2021-02-16 20:57:52           0

It has also associated it with the competition. That’s neat.

Lets try changing it.

Code

with tempfile.TemporaryDirectory() as directory:
    metadata_file = Path(directory) / "kernel-metadata.json"
    metadata_file.write_text(json.dumps({
        "id": "matthewfranglen/test-kernel",
        "title": "test-kernel",
        "code_file": "test.py",
        "language": "python",
        "kernel_type": "script",
        "is_private": True,
        "enable_gpu": False,
        "enable_internet": False,
        "competition_sources": ["hubmap-kidney-segmentation"],
    }))
    script_file = Path(directory) / "test.py"
    script_file.write_text("""
from pathlib import Path

Path("/kaggle/working/output.txt").write_text("goodbye cruel world!")
""")
    response = %sx ! kaggle kernels push --path {directory}
    print(response)

['Kernel version 2 successfully pushed.  Please check progress at https://www.kaggle.com/matthewfranglen/test-kernel']

Once again it has worked!

This actually seems quite straightforward.

As far as integrating this with git - I don’t think it’s possible. The kaggle API cannot reasonably be added as a remote as it has rather different behaviour. While a push hook could be used to keep the Kaggle kernel up to date, I think I would rather use a make target to manually update it.

If only there was a way to handle multiple files in a kernel, then working with Kaggle would integrate much more tightly with my existing workflow. Perhaps I can create an auto script smasher that puts all of the different files into one? Something for another day I think.