Available on: Enterprise Edition>= 0.18.0

Run tasks as containers on Google Cloud VMs.

How to use the Google Batch task runner

This task runner will deploy the container for the task to a specified Google Cloud Batch VM.

To launch the task on Google Cloud Batch, there are three main concepts you need to be aware of:

  1. Machine Type — a mandatory property indicating the compute machine type to which the task will be deployed. If no reservation is specified, a new compute instance will be created for each batch which can add around a minute of latency.
  2. Reservation — an optional property allowing you to set up a reservation for a given virtual machine and avoid the time needed to create a new compute instance for each task.
  3. Network Interfaces — an optional property; if not specified, it will use the default interface if not specified otherwise.

How does Google Batch task runner work

In order to support inputFiles, namespaceFiles, and outputFiles, the Google Batch task runner will do the following:

  • mount a volume from a GCS bucket
  • upload input files to the bucket before launching the container
  • download output files from the bucket after the container has finished running.

Since we don't know the working directory of the container in advance, you need to explicitly define the working directory and output directory when using the GCP Batch runner, e.g. use python {{ workingDir }}/main.py rather than python main.py.

A full flow example

yaml
id: gcp_batch_runner
namespace: company.team

variables:
  region: europe-west9

tasks:
  - id: scrape_environment_info
    type: io.kestra.plugin.scripts.python.Commands
    containerImage: ghcr.io/kestra-io/pydata:latest
    taskRunner:
      type: io.kestra.plugin.ee.gcp.runner.Batch
      projectId: "{{ secret('GCP_PROJECT_ID') }}"
      region: "{{ vars.region }}"
      bucket: "{{ secret('GCS_BUCKET')}}"
      serviceAccount: "{{ secret('GOOGLE_SA') }}"
    commands:
      - python {{ workingDir }}/main.py
    namespaceFiles:
      enabled: true
    outputFiles:
      - "environment_info.json"
    inputFiles:
      main.py: |
        import platform
        import socket
        import sys
        import json
        from kestra import Kestra

        print("Hello from GCP Batch and kestra!")

        def print_environment_info():
            print(f"Host's network name: {platform.node()}")
            print(f"Python version: {platform.python_version()}")
            print(f"Platform information (instance type): {platform.platform()}")
            print(f"OS/Arch: {sys.platform}/{platform.machine()}")

            env_info = {
                "host": platform.node(),
                "platform": platform.platform(),
                "OS": sys.platform,
                "python_version": platform.python_version(),
            }
            Kestra.outputs(env_info)

            filename = '{{ workingDir }}/environment_info.json'
            with open(filename, 'w') as json_file:
                json.dump(env_info, json_file, indent=4)

        if __name__ == '__main__':
          print_environment_info()

Full step-by-step guide: setting up Google Batch from scratch

Before you begin

Before you start, you need to have the following:

  1. A Google Cloud account.
  2. A Kestra instance in a version 0.16.0 or later with Google credentials stored as secrets or environment variables within the Kestra instance.

Google Cloud Console Setup

Create a project

If you don't already have a project, create one with a name of your choice.

project

Once you've done this, make sure your project is selected in the menu bar.

project_selection

Enable Batch API

Inside of the Cloud Console, you'll need to go to APIs and search for Batch API. Once you've got to it, you'll need to enable it so that Kestra can create Batch Jobs.

batchapi

After you've enabled it, you'll be prompted to make credentials for it so you can integrate it with your application.

Create the Service Account

Now that the Batch API is enabled, we can proceed with creating credentials so we can access GCP directly inside of Kestra. Following the prompt after enabling the Batch API, we will be asked to select the type of data we will be using. In this case, it's Application data which will create a Service Account for us to use.

api-credentials-1

After you've selected this, you'll need to give a name to your service account. Name it something memorable, as we'll need to type this into Kestra later.

sa-1

Once you've given it a name, make sure to select the following roles:

  • Batch Job Editor
  • Logs Viewer
  • Storage Object Admin

roles

Afterwards, we will need to make credentials for our service account so we can add these into Kestra. To do so, select our service account and select Keys then Add Key.

Make sure to select JSON as the Key type so we can either add it as a secret or directly into our flow.

create-key

Check out this guide on how to add your service account into Kestra as a secret.

We'll also need to make sure our service account can access the Compute Engine default service account so it can create jobs.

To do this, we can go to IAM & Admin, then Service Accounts. On this page, we can select the compute engine service account, select Permissions and then Grant Access. On this page, we want to add our original Service account as a Service Account User role. Once we've done this, we can select Save.

compute

Create Bucket

Head to the search bar and type "Bucket" to find GCS Bucket. Now create a new bucket! You'll be prompted to set a name, region and various other permissions. For now, we can leave these all to default.

bucket

Creating our Flow

Below is an example flow that will run a Python file called main.py on a GCP Batch Task Runner. At the top of the io.kestra.plugin.scripts.python.Commands task, there are the properties for defining our Task Runner:

yaml
containerImage: ghcr.io/kestra-io/kestrapy:latest
taskRunner:
  type: io.kestra.plugin.ee.gcp.runner.Batch
  projectId: "{{ secret('GCP_PROJECT_ID') }}"
  region: "{{ vars.region }}"
  bucket: "{{ secret('GCS_BUCKET') }}"
  serviceAccount: "{{ secret('GOOGLE_SA') }}"

This is where we can enter the details for GCP such as the projectId, region, bucket, as well as serviceAccount. We can add these all as secrets.

yaml
id: gcp_batch_runner
namespace: company.team

variables:
  region: europe-west2

tasks:
  - id: scrape_environment_info
    type: io.kestra.plugin.scripts.python.Commands
    containerImage: ghcr.io/kestra-io/kestrapy:latest
    taskRunner:
      type: io.kestra.plugin.ee.gcp.runner.Batch
      projectId: "{{ secret('GCP_PROJECT_ID') }}"
      region: "{{ vars.region }}"
      bucket: "{{ secret('GCS_BUCKET') }}"
      serviceAccount: "{{ secret('GOOGLE_SA') }}"
    commands:
      - python {{ workingDir }}/main.py
    namespaceFiles:
      enabled: true
    outputFiles:
      - "environment_info.json"
    inputFiles:
      main.py: |
        import platform
        import socket
        import sys
        import json
        from kestra import Kestra

        print("Hello from GCP Batch and kestra!")

        def print_environment_info():
            print(f"Host's network name: {platform.node()}")
            print(f"Python version: {platform.python_version()}")
            print(f"Platform information (instance type): {platform.platform()}")
            print(f"OS/Arch: {sys.platform}/{platform.machine()}")

            env_info = {
                "host": platform.node(),
                "platform": platform.platform(),
                "OS": sys.platform,
                "python_version": platform.python_version(),
            }
            Kestra.outputs(env_info)

            filename = '{{workingDir}}/environment_info.json'
            with open(filename, 'w') as json_file:
                json.dump(env_info, json_file, indent=4)

        print_environment_info()

When we press execute, we can see that our task runner is created in the Logs.

logs

We can also go to the GCP Console and see our task runner has been created:

batch-jobs

Once the task has completed, it will automatically close down the runner on Google Cloud.

We can also view the outputs generated in the Outputs tab in Kestra, which contains information about the Google Batch task runner generated from our Python script:

outputs

Was this page helpful?