PluginsGoogle CloudTasksDataprocBatchesPySparkSubmit

PySparkSubmit

yaml

type: "io.kestra.plugin.gcp.dataproc.batches.PySparkSubmit"

Submit an Apache PySpark batch workload.

Examples

yaml

id: gcp_dataproc_py_spark_submit
namespace: company.team
tasks:
  - id: py_spark_submit
    type: io.kestra.plugin.gcp.dataproc.batches.PySparkSubmit
    mainPythonFileUri: 'gs://spark-jobs-kestra/pi.py'
    name: test-pyspark
    region: europe-west3

Properties

`mainPythonFileUri`

Type: string
Dynamic: ✔️
Required: ✔️

The HCFS URI of the main Python file to use as the Spark driver. Must be a .py file.

Hadoop Compatible File System (HCFS) URIs should be accessible from the cluster. Can be a GCS file with the gs:// prefix, an HDFS file on the cluster with the hdfs:// prefix, or a local file on the cluster with the file:// prefix

`name`

Type: string
Dynamic: ✔️
Required: ✔️

The batch name

`region`

Type: string
Dynamic: ✔️
Required: ✔️

The region

`archiveUris`

Type: array
SubType: string
Dynamic: ✔️
Required: ❌

HCFS URIs of archives to be extracted into the working director of each executor. Supported file types: .jar, .tar, .tar.gz, .tgz, and .zip.

Hadoop Compatible File System (HCFS) URIs should be accessible from the cluster. Can be a GCS file with the gs:// prefix, an HDFS file on the cluster with the hdfs:// prefix, or a local file on the cluster with the file:// prefix

`args`

Type: array
SubType: string
Dynamic: ✔️
Required: ❌

The arguments to pass to the driver.

Do not include arguments that can be set as batch properties, such as --conf, since a collision can occur that causes an incorrect batch submission.

`execution`

Type: AbstractBatch-ExecutionConfiguration
Dynamic: ✔️
Required: ❌

Execution configuration for a workload.

`fileUris`

Type: array
SubType: string
Dynamic: ✔️
Required: ❌

HCFS URIs of files to be placed in the working directory of each executor.

Hadoop Compatible File System (HCFS) URIs should be accessible from the cluster. Can be a GCS file with the gs:// prefix, an HDFS file on the cluster with the hdfs:// prefix, or a local file on the cluster with the file:// prefix

`jarFileUris`

Type: array
SubType: string
Dynamic: ✔️
Required: ❌

HCFS URIs of jar files to add to the classpath of the Spark driver and tasks.

Hadoop Compatible File System (HCFS) URIs should be accessible from the cluster. Can be a GCS file with the gs:// prefix, an HDFS file on the cluster with the hdfs:// prefix, or a local file on the cluster with the file:// prefix

`peripherals`

Type: AbstractBatch-PeripheralsConfiguration
Dynamic: ✔️
Required: ❌

Peripherals configuration for a workload.

`projectId`

Type: string
Dynamic: ✔️
Required: ❌

The GCP project ID.

`runtime`

Type: AbstractBatch-RuntimeConfiguration
Dynamic: ✔️
Required: ❌

Runtime configuration for a workload.

`scopes`

Type: array
SubType: string
Dynamic: ✔️
Required: ❌
Default: [https://www.googleapis.com/auth/cloud-platform]

The GCP scopes to be used.

`serviceAccount`

Type: string
Dynamic: ✔️
Required: ❌

The GCP service account key.

Outputs

`state`

Type: string
Required: ❌
Possible Values:
- STATE_UNSPECIFIED
- PENDING
- RUNNING
- CANCELLING
- CANCELLED
- SUCCEEDED
- FAILED
- UNRECOGNIZED

The state of the batch.

Definitions

`io.kestra.plugin.gcp.dataproc.batches.AbstractBatch-PeripheralsConfiguration`

Properties

`metastoreService`

Type: string
Dynamic: ✔️
Required: ❌

Resource name of an existing Dataproc Metastore service.

Example: projects/[project_id]/locations/[region]/services/[service_id]

`sparkHistoryServer`

Type: AbstractBatch-SparkHistoryServerConfiguration
Dynamic: ✔️
Required: ❌

Resource name of an existing Dataproc Metastore service.

Example: projects/[project_id]/locations/[region]/services/[service_id]

`io.kestra.plugin.gcp.dataproc.batches.AbstractBatch-RuntimeConfiguration`

Properties

`containerImage`

Type: string
Dynamic: ✔️
Required: ❌

Optional custom container image for the job runtime environment.

If not specified, a default container image will be used.

`properties`

Type: object
SubType: string
Dynamic: ✔️
Required: ❌

properties used to configure the workload execution (map of key/value pairs).

`version`

Type: string
Dynamic: ✔️
Required: ❌

Version of the batch runtime.

`io.kestra.plugin.gcp.dataproc.batches.AbstractBatch-SparkHistoryServerConfiguration`

Properties

`dataprocCluster`

Type: string
Dynamic: ✔️
Required: ❌

Resource name of an existing Dataproc Cluster to act as a Spark History Server for the workload.

Example: projects/[project_id]/regions/[region]/clusters/[cluster_name]

`io.kestra.plugin.gcp.dataproc.batches.AbstractBatch-ExecutionConfiguration`

Properties

`kmsKey`

Type: string
Dynamic: ✔️
Required: ❌

The Cloud KMS key to use for encryption.

`networkTags`

Type: array
SubType: string
Dynamic: ✔️
Required: ❌

Tags used for network traffic control.

`networkUri`

Type: string
Dynamic: ✔️
Required: ❌

Network URI to connect workload to.

`serviceAccountEmail`

Type: string
Dynamic: ✔️
Required: ❌

Service account used to execute workload.

`subnetworkUri`

Type: string
Dynamic: ✔️
Required: ❌

Subnetwork URI to connect workload to.

Was this page helpful?

​Py​Spark​Submit

PySparkSubmit

metastoreService

sparkHistoryServer

containerImage

properties

version

dataprocCluster

kmsKey

networkTags

networkUri

serviceAccountEmail

subnetworkUri

PySparkSubmit

`metastoreService`

`sparkHistoryServer`

`containerImage`

`properties`

`version`

`dataprocCluster`

`kmsKey`

`networkTags`

`networkUri`

`serviceAccountEmail`

`subnetworkUri`