Skip to main content
Filter by
Sorted by
Tagged with
1 vote
1 answer
76 views

Creating Dataproc Cluster with public-ip-address using DataprocCreateClusterOperator in Airflow

I am trying to create a Dataproc Cluster in my GCP project within an Airflow DAG using the DataprocCreateClusterOperator. I am using the ClusterGenerator to generate the config for the cluster. ...
Mads's user avatar
  • 105
0 votes
0 answers
34 views

dataproc cluster launching failing

I cannot create a dataproc cluster on compute engine in a given vpc/subnet. even though the firewall rule exists in said vpc. Firewall rule Initialisation failed. Enabling Diagnostic mode com.google....
L. Arriola's user avatar
1 vote
0 answers
55 views

SecretManagerClient.create() raise NoSuchMethodError io.grpc.MethodDescriptor$Marshaller with Scala on Dataproc Batch

I'm trying to read from google secretmanager. I deploy my scala/spark as Fat JAR on dataproc with SBT. When I call : val client: SecretManagerServiceClient = SecretManagerServiceClient.create() I ...
Arno's user avatar
  • 101
-1 votes
1 answer
54 views

running normal java non-spark application in spark cluster

I want to run/execute a normal java application which connects to teradata database. I would like to run this java app in spark cluster although my java app is non-spark. Questions are as follows Is ...
ironfreak's user avatar
0 votes
2 answers
73 views

Is there a way to pass the dataproc version via the Airflow DataprocCreateBatchOperator method?

I've just run into an issue where the default version of dataproc upgraded and broke my job which is being submitted using the DataprocCreateBatchOperator method through airflow. task2 = ...
Velo's user avatar
  • 65
0 votes
1 answer
90 views

ModuleNotFoundError: No module named 'minio' when submitting a PySpark job on Google Cloud Dataproc

I’m facing an issue when trying to submit a PySpark job to Google Cloud Dataproc. The goal is to run a script on the Dataproc cluster that uses the minio module. However, I keep encountering the ...
mikee thanh's user avatar
0 votes
0 answers
26 views

Dataproc : unable to find environment properties by archive path

I submit a dataproc job as below: gcloud dataproc jobs submit pyspark \ gs://persopnal/test/file1.py \ --py-files gs://persopnal/test/file1.py,gs://persopnal/test/file2.py \ --jars $jar \ --archieves ...
CodingSnowBall's user avatar
0 votes
0 answers
17 views

Unable to send ssl file path as URL in dataproc ,spark.read

I am using JDBC to connect to DB2 using Dataproc/Pyspark. Since it uses secure port, it requires some SSL information as part of the URL: Here's the format: jdbc:db2://IP_Addr:PORT/DB:sslConnection=...
Ayush Jha's user avatar
0 votes
1 answer
47 views

Creation of Component Gateway Viewer (CGV) in GCP Dataproc cluster behaves differently

I have a weird context with respect to component gateway viewer (CGV) creation in GCP Dataproc cluster. I create the Dataproc cluster along with CGV via Terraform as described below: #Cluster is ...
user3103957's user avatar
1 vote
0 answers
149 views

Issue with properly setting up a Spark Session (Dataproc) to my Apache Iceberg BigLake tables

I've successfully set up an Iceberg table with a BigLake Metastore as per this documentation: https://cloud.google.com/bigquery/docs/iceberg-tables Everything works, I can see the Iceberg table in ...
RyuHayabusa's user avatar
1 vote
0 answers
85 views

Bigtable Read and Write using DataProc with compute engine results in Key not found

I am experimenting with reading and writing data in cloud BigTable using the DataProc compute engine and PySpark Job using spark-bigtable-connector. I got an example from spark-bigtable repo and ...
Suga Raj's user avatar
  • 551
0 votes
0 answers
75 views

GCP Data Proc cluster creation Warning

Hey guys I am new to cloud and GCP. Encountering this warning message while creating a data proc cluster using the GCP cli. I have the firewall rule set for Dataproc to function. Yet getting this ...
Dhananjay R Hunasgi's user avatar
0 votes
1 answer
122 views

Error while scanning intermediate done dir - dataproc spark job

Our spark aggregation jobs are taking a lot of execution time to complete. It supposed to complete in 5 mins but taking 30 to 40 minutes to complete. dataproc cluster logging say it's trying to scan ...
vikrant rana's user avatar
  • 4,654
1 vote
0 answers
63 views

dataproc ERROR ClusterManager: Could not initialize cluster nodes

I have created in Dataproc a cluster with 1 master node and 5 workers. When I run a PySpark job I get this error, ERROR ClusterManager: Could not initialize cluster nodes currentNodeIndex=null I ...
Elias's user avatar
  • 11
0 votes
1 answer
183 views

Set Spark configuration when running python in dbt for BigQuery

Making some progress on a proof of concept for a python dbt model in GCP (BigQuery). Built a dataproc cluster for Spark and able to execute the model, but I'm getting an error in the model that ...
Bo Bucklen's user avatar
0 votes
0 answers
37 views

Optimization of my spark dataproc cluster job

Below is my dataproc job log to process around 50GB of data Logs: Time spent in Driver vs Executors Driver WallClock Time 129m 50s 80.96% Executor WallClock Time 30m 32s 19.04% ...
Karan Kumar Gupta's user avatar
0 votes
0 answers
25 views

Error "failed to list the clusters" creating cloud serverless runtime template

I try to create a dataproc cloud serverless runtime template to get a spark environment notebook in a vertex ai workbench instance. If click on the "New Runtime Template" in category ...
Jochen 's user avatar
0 votes
1 answer
168 views

Migration from Spark 3.1 to 3.3 org.apache.spark.shuffle.FetchFailedException failures

Pipeline is running fine in spark 3.1 version (dataproc 2), while migrating to spark 3.3 (Dataproc 2.1), job is failing with below error. FetchFailed(BlockManagerId(246, ---internal, 7337, None), ...
Lakesh Kumar's user avatar
0 votes
0 answers
54 views

PySpark on Dataproc out of memory

I am trying to load 1.5 million rows with into a BigQuery table using PySpark: spark = SparkSession.builder.appName('process-prices') \ .config("spark.memory.offHeap.enabled", "...
Marcelo Gazzola's user avatar
0 votes
0 answers
19 views

Imports failing with workaround in Google Dataproc Cluster Notebooks

I am running several experiments in notebooks running on a Spark Dataproc cluster. Many of the functions stay the same from experiment to experiment (such as data preprocessing). But, after some ...
Kristin Day's user avatar
1 vote
0 answers
118 views

How to run a Spark job on Dataproc with custom conda env file

I'm trying to run a Spark job on Dataproc with a custom conda environment. Here's my environment yaml file: name: parallel-jobs-on-dataproc channels: - default dependencies: - python=3.11 - ...
gregorp's user avatar
  • 275
0 votes
2 answers
95 views

CDF custom Plugin on DataProc - Provider for class javax.xml.parsers.DocumentBuilderFactory cannot be created

To set the context, I have created a custom CDAP transform plugin and tested it in CDAP local successfully. However when I deploy it to GCP CDF instance, everything works fine till preview mode but ...
Rajiv R's user avatar
1 vote
1 answer
426 views

Python Dependecies for DataprocCreateBatchOperator

Cannot submit Python job onto dataproc serverless when third party python dependencies are needed. It's working fine when dependencies are not needed. I push up the pyspark python file to a cloud ...
baskInEminence's user avatar
1 vote
0 answers
514 views

GCP DataProc Serverless - VPC/subnet/firewall requirements

I'm trying my first GCP Dataproc Serverless PySpark batch job, which needs to connect to a public REST endpoint, and write to a GCS bucket in the same GCP project. After having submitted, the job was ...
Averell's user avatar
  • 843
0 votes
0 answers
56 views

Unable to find Dataproc Yarn aggregated and spark driver logs in GCP Cloud Logging

I am following this document while creating a new Dataproc cluster. I have set Dataproc properties for Spark driver logs in this manner - spark:spark.submit.deployMode = cluster dataproc:dataproc....
Shivam Soliya's user avatar
0 votes
0 answers
40 views

FileNotFoundException for temporary file when runs Spark on Dataproc/Yarn

I need download a binary file with proprietary format, make a conversion, then move converted back to storage. I create directory/file on /tmp, using java Files.createTempDirectory and make the ...
Allan Silva's user avatar
0 votes
0 answers
29 views

Submitting requests to a job running in a Dataproc cluster in GCP

I have a Dataproc cluster running in GCP. The cluster has a spark job which is always running and listening for requests. Upon receiving the requests, it needs to process it and returns the result in ...
user3103957's user avatar
0 votes
0 answers
278 views

Dataproc spark job (long running) on cloudrun on Gcp

I had a small doubt to clarify as an expert advice Can I run dataproc (pyspark job) an an image on cloud run (service or job) as dataproc job may take few minutes to hours to complete so it will ...
Yogesh Sahu's user avatar
0 votes
1 answer
194 views

How to change log level in dataproc serverless spark

In dataproc serverless, the underlying vm is hidden. How to change log level in dataproc serverless spark? --properties ^@^spark.executor.extraJavaOptions=-Dlog4j.configuration=spark-log4j2.properties ...
Warren Zhu's user avatar
  • 1,495
2 votes
1 answer
130 views

ValueError: unknown enum label "Hudi"

I am using dataproc 2.1 with the following software_config in a json file. "software_config": { "properties": {}, "optional_components": ["JUPYTER&...
user19930511's user avatar
0 votes
0 answers
54 views

Create an email alert for a PySpark job executing on Google Dataproc

I am currently working on a PySpark job running on Google Dataproc and need to implement an email alert when. The goal is to receive notifications when specific events or conditions occur during the ...
Mercy 's user avatar
0 votes
1 answer
224 views

Error installing package from private repository on Dataproc cluster

In order to inizialize my dataproc cluster i'm trying to set up keyring authentication to a private pip repository, I followed the steps on Setting up authentication to Python package repositories - ...
Lorenzo Ventura's user avatar
0 votes
1 answer
304 views

Invalid Argument When Creating Dataproc Cluster on GKE

I'm trying to create a dataproc cluster on GKE, through the gcp console using the following command gcloud dataproc clusters gke create dp-gke-cluster --region=europe-west1 --gke-cluster=gke-cluster --...
rqiu's user avatar
  • 46
-1 votes
1 answer
427 views

Not able to send Composer(airflow job) to dataproc

Im trying to submit an airflow job in composer (gcp) to dataproc. When i do this i get the following error ask failed with exception Traceback (most recent call last): File "/opt/python3.8/lib/...
luckyman80's user avatar
0 votes
1 answer
28 views

Ordering of records in BigQuery tables while running DataProc jobs

Ordering of records in BigQuery tables while running DataProc jobs. While executing a Dataproc job on BigQuery tables, the order of output is changed. Even when you import a given file as a BigQuery ...
Manish Kukreja's user avatar
2 votes
1 answer
361 views

java.lang.NoClassDefFoundError: org/apache/beam/sdk/coders/CoderProviderRegistrar

I'm getting this error while trying to submit a shaded jar to dataproc as a spark job : java.lang.NoClassDefFoundError: org/apache/beam/sdk/coders/CoderProviderRegistrar I'm sure that this class ...
Ahmed Hesham's user avatar
1 vote
1 answer
299 views

Not able to write into BigQuery JSON Field with Pyspark

I am trying to write JSON object into a BigQuery Table Field. I am using GCP DATAPROC Batch Job. Could you please let me know if there is a way to write into BQ Table JSON Field with Spark. Dataproc ...
Bhargav Velisetti's user avatar
0 votes
1 answer
345 views

Cluster validation error in GCP DataProc : Requested number of primary workers must be within autoscaler min/max range

We are trying to spin up a DataProc cluster in GCP. We get the below errors when we do so: The first error is related to primary worker group and the second is to secondary worker group of nodes. In ...
user3103957's user avatar
0 votes
1 answer
409 views

PyFlink job encountering "No module named 'google'" error when using FlinkKafkaConsumer

I'm working on a PyFlink job that reads data from a Kafka topic using the FlinkKafkaConsumer connector. However, I'm encountering a persistent issue related to the google module when trying to run the ...
sam1064max's user avatar
0 votes
1 answer
672 views

Dataproc Serverless for Spark Batch runs into a timeout after 4 hours

How can I increase the time limit? Logs show "Found 1 invalidated leases" and "Task srvls-batch-d702cb8b-1d45-44d2-bf2e-6bf6275f66bf lease grant was revoked, cancelling work." ...
Uttkarsh Berwal's user avatar
1 vote
0 answers
159 views

PySpark Job on Dataproc Throws IOException but Still Completes Successfully

I'm running a PySpark job on Google Cloud Dataproc, which uses structured streaming with a trigger of 'once'. The job reads Parquet data from a raw layer (a GCS bucket), applies certain business rules,...
Puredepatata's user avatar
1 vote
1 answer
302 views

Pub/Sub Publish message from Dataproc cluster using Python: ACCESS_TOKEN_SCOPE_INSUFFICIENT

I have a problem publishing a pub/sub message from dataproc cluster, from cloud function it works well with a service account, but with dataproc I got this error:  raise exceptions.from_grpc_error(exc)...
sqoor's user avatar
  • 123
1 vote
1 answer
222 views

Configure trino-jvm properties in GCP Dataproc on cluster create

I'm trying to configure trino-jvm properties while creating a Dataproc cluster. I'm following Google's documentation and am able to successfully create a cluster without any special JVM configuration,...
Chris Noreikis's user avatar
0 votes
1 answer
61 views

does google provide techincal support for dataproc's optional components ex. Ranger?

does google provide techincal support for dataproc's optional components ex. Ranger? if yes, can someone leave a link to verify?
ciao_bella's user avatar
0 votes
2 answers
222 views

In Dataproc, whether or not the file prefix should be used when applying a property to job?

Actually the document explicitly states: When applying a property to a job, the file prefix is not used. However, the example given there is inconsistent with this This is what the page says: ......
figs_and_nuts's user avatar
1 vote
1 answer
748 views

Pyspark with custom container on GCP Dataproc Serverless : access to class in custom container image

I’m trying to start a job Pyspark on GCP Dataproc Serverless with custom container, but when I tried to access to my main class in my custom image, I found this exception: Exception in thread "...
Sophie192's user avatar
1 vote
1 answer
123 views

How to enable outside connection before submit Pyspark job to Dataproc

I have a Pyspark file which will be submitted to Dataproc. try: print("Start writing") url = "jdbc:postgresql://some-ip:5432/postgres" properties = { "...
Wasin Phaijith's user avatar
2 votes
0 answers
400 views

Accumulator is failing to update on pyspark job ran on dataproc cluster

So the individual pyspark jobs do complete I can see that in the logs but while accumulation I get the following exception. The same piece of code when executed locally is working fine. The cluster ...
Rohan Aswani's user avatar
2 votes
1 answer
176 views

Yarn CPU usage and the result of htop on workers are incosistent. I am running a SPARK cluster on Dataproc

I am on Dataproc managed spark cluster OS = Ubuntu 18.04 Spark version = 3.3.0 My cluster configuration is as follows: Master Memory = 7.5 GiB Cores = 2 Primary disk size = 32 GB Workers Cores = ...
figs_and_nuts's user avatar
1 vote
1 answer
414 views

Yarn allocates only 1 core per container. Running spark on yarn

Please ensure dynamic allocation is not killing your containers while you monitor the YARN UI. See the answer below Issue: I can start the SparkSession with any number of cores per executor and the ...
figs_and_nuts's user avatar