Newest 'dataproc' Questions

1 vote

1 answer

76 views

Creating Dataproc Cluster with public-ip-address using DataprocCreateClusterOperator in Airflow

I am trying to create a Dataproc Cluster in my GCP project within an Airflow DAG using the DataprocCreateClusterOperator. I am using the ClusterGenerator to generate the config for the cluster. ...

Mads

105

asked Sep 30 at 9:08

0 votes

0 answers

34 views

dataproc cluster launching failing

I cannot create a dataproc cluster on compute engine in a given vpc/subnet. even though the firewall rule exists in said vpc. Firewall rule Initialisation failed. Enabling Diagnostic mode com.google....

L. Arriola

1

asked Sep 16 at 12:15

1 vote

0 answers

55 views

SecretManagerClient.create() raise NoSuchMethodError io.grpc.MethodDescriptor$Marshaller with Scala on Dataproc Batch

I'm trying to read from google secretmanager. I deploy my scala/spark as Fat JAR on dataproc with SBT. When I call : val client: SecretManagerServiceClient = SecretManagerServiceClient.create() I ...

Arno

101

asked Aug 30 at 13:47

-1 votes

1 answer

54 views

running normal java non-spark application in spark cluster

I want to run/execute a normal java application which connects to teradata database. I would like to run this java app in spark cluster although my java app is non-spark. Questions are as follows Is ...

ironfreak

37

asked Aug 29 at 1:34

0 votes

2 answers

73 views

Is there a way to pass the dataproc version via the Airflow DataprocCreateBatchOperator method?

I've just run into an issue where the default version of dataproc upgraded and broke my job which is being submitted using the DataprocCreateBatchOperator method through airflow. task2 = ...

Velo

65

asked Aug 28 at 10:24

0 votes

1 answer

90 views

ModuleNotFoundError: No module named 'minio' when submitting a PySpark job on Google Cloud Dataproc

I’m facing an issue when trying to submit a PySpark job to Google Cloud Dataproc. The goal is to run a script on the Dataproc cluster that uses the minio module. However, I keep encountering the ...

mikee thanh

1

asked Aug 23 at 18:24

0 votes

0 answers

26 views

Dataproc : unable to find environment properties by archive path

I submit a dataproc job as below: gcloud dataproc jobs submit pyspark \ gs://persopnal/test/file1.py \ --py-files gs://persopnal/test/file1.py,gs://persopnal/test/file2.py \ --jars $jar \ --archieves ...

CodingSnowBall

11

asked Aug 14 at 14:16

0 votes

0 answers

17 views

Unable to send ssl file path as URL in dataproc ,spark.read

I am using JDBC to connect to DB2 using Dataproc/Pyspark. Since it uses secure port, it requires some SSL information as part of the URL: Here's the format: jdbc:db2://IP_Addr:PORT/DB:sslConnection=...

Ayush Jha

1

asked Aug 9 at 12:31

0 votes

1 answer

47 views

Creation of Component Gateway Viewer (CGV) in GCP Dataproc cluster behaves differently

I have a weird context with respect to component gateway viewer (CGV) creation in GCP Dataproc cluster. I create the Dataproc cluster along with CGV via Terraform as described below: #Cluster is ...

user3103957

848

asked Jul 4 at 21:29

1 vote

0 answers

149 views

Issue with properly setting up a Spark Session (Dataproc) to my Apache Iceberg BigLake tables

I've successfully set up an Iceberg table with a BigLake Metastore as per this documentation: https://cloud.google.com/bigquery/docs/iceberg-tables Everything works, I can see the Iceberg table in ...

RyuHayabusa

73

asked Jul 1 at 12:44

1 vote

0 answers

85 views

Bigtable Read and Write using DataProc with compute engine results in Key not found

I am experimenting with reading and writing data in cloud BigTable using the DataProc compute engine and PySpark Job using spark-bigtable-connector. I got an example from spark-bigtable repo and ...

Suga Raj

551

asked Jun 26 at 17:07

0 votes

0 answers

75 views

GCP Data Proc cluster creation Warning

Hey guys I am new to cloud and GCP. Encountering this warning message while creating a data proc cluster using the GCP cli. I have the firewall rule set for Dataproc to function. Yet getting this ...

Dhananjay R Hunasgi

1

asked Jun 20 at 13:27

0 votes

1 answer

122 views

Error while scanning intermediate done dir - dataproc spark job

Our spark aggregation jobs are taking a lot of execution time to complete. It supposed to complete in 5 mins but taking 30 to 40 minutes to complete. dataproc cluster logging say it's trying to scan ...

vikrant rana

4,654

asked Jun 16 at 2:06

1 vote

0 answers

63 views

dataproc ERROR ClusterManager: Could not initialize cluster nodes

I have created in Dataproc a cluster with 1 master node and 5 workers. When I run a PySpark job I get this error, ERROR ClusterManager: Could not initialize cluster nodes currentNodeIndex=null I ...

Elias

11

asked Jun 14 at 14:20

0 votes

1 answer

183 views

Set Spark configuration when running python in dbt for BigQuery

Making some progress on a proof of concept for a python dbt model in GCP (BigQuery). Built a dataproc cluster for Spark and able to execute the model, but I'm getting an error in the model that ...

Bo Bucklen

1

asked Jun 10 at 21:30

0 votes

0 answers

37 views

Optimization of my spark dataproc cluster job

Below is my dataproc job log to process around 50GB of data Logs: Time spent in Driver vs Executors Driver WallClock Time 129m 50s 80.96% Executor WallClock Time 30m 32s 19.04% ...

Karan Kumar Gupta

1

asked May 3 at 12:33

0 votes

0 answers

25 views

Error "failed to list the clusters" creating cloud serverless runtime template

I try to create a dataproc cloud serverless runtime template to get a spark environment notebook in a vertex ai workbench instance. If click on the "New Runtime Template" in category ...

Jochen

21

asked Apr 15 at 8:59

0 votes

1 answer

168 views

Migration from Spark 3.1 to 3.3 org.apache.spark.shuffle.FetchFailedException failures

Pipeline is running fine in spark 3.1 version (dataproc 2), while migrating to spark 3.3 (Dataproc 2.1), job is failing with below error. FetchFailed(BlockManagerId(246, ---internal, 7337, None), ...

Lakesh Kumar

1

asked Apr 3 at 11:13

0 votes

0 answers

54 views

PySpark on Dataproc out of memory

I am trying to load 1.5 million rows with into a BigQuery table using PySpark: spark = SparkSession.builder.appName('process-prices') \ .config("spark.memory.offHeap.enabled", "...

Marcelo Gazzola

1,073

asked Apr 2 at 14:36

0 votes

0 answers

19 views

Imports failing with workaround in Google Dataproc Cluster Notebooks

I am running several experiments in notebooks running on a Spark Dataproc cluster. Many of the functions stay the same from experiment to experiment (such as data preprocessing). But, after some ...

Kristin Day

1

asked Mar 21 at 20:38

1 vote

0 answers

118 views

How to run a Spark job on Dataproc with custom conda env file

I'm trying to run a Spark job on Dataproc with a custom conda environment. Here's my environment yaml file: name: parallel-jobs-on-dataproc channels: - default dependencies: - python=3.11 - ...

gregorp

275

asked Mar 19 at 14:11

0 votes

2 answers

95 views

CDF custom Plugin on DataProc - Provider for class javax.xml.parsers.DocumentBuilderFactory cannot be created

To set the context, I have created a custom CDAP transform plugin and tested it in CDAP local successfully. However when I deploy it to GCP CDF instance, everything works fine till preview mode but ...

Rajiv R

1

asked Mar 6 at 12:05

1 vote

1 answer

426 views

Python Dependecies for DataprocCreateBatchOperator

Cannot submit Python job onto dataproc serverless when third party python dependencies are needed. It's working fine when dependencies are not needed. I push up the pyspark python file to a cloud ...

baskInEminence

762

asked Mar 1 at 22:06

1 vote

0 answers

514 views

GCP DataProc Serverless - VPC/subnet/firewall requirements

I'm trying my first GCP Dataproc Serverless PySpark batch job, which needs to connect to a public REST endpoint, and write to a GCS bucket in the same GCP project. After having submitted, the job was ...

Averell

843

asked Feb 21 at 9:20

0 votes

0 answers

56 views

Unable to find Dataproc Yarn aggregated and spark driver logs in GCP Cloud Logging

I am following this document while creating a new Dataproc cluster. I have set Dataproc properties for Spark driver logs in this manner - spark:spark.submit.deployMode = cluster dataproc:dataproc....

Shivam Soliya

1

asked Feb 19 at 19:39

0 votes

0 answers

40 views

FileNotFoundException for temporary file when runs Spark on Dataproc/Yarn

I need download a binary file with proprietary format, make a conversion, then move converted back to storage. I create directory/file on /tmp, using java Files.createTempDirectory and make the ...

Allan Silva

1

asked Feb 6 at 6:12

0 votes

0 answers

29 views

Submitting requests to a job running in a Dataproc cluster in GCP

I have a Dataproc cluster running in GCP. The cluster has a spark job which is always running and listening for requests. Upon receiving the requests, it needs to process it and returns the result in ...

user3103957

848

asked Jan 19 at 21:57

0 votes

0 answers

278 views

Dataproc spark job (long running) on cloudrun on Gcp

I had a small doubt to clarify as an expert advice Can I run dataproc (pyspark job) an an image on cloud run (service or job) as dataproc job may take few minutes to hours to complete so it will ...

Yogesh Sahu

1

asked Jan 16 at 18:46

0 votes

1 answer

194 views

How to change log level in dataproc serverless spark

In dataproc serverless, the underlying vm is hidden. How to change log level in dataproc serverless spark? --properties ^@^spark.executor.extraJavaOptions=-Dlog4j.configuration=spark-log4j2.properties ...

Warren Zhu

1,495

asked Dec 26, 2023 at 18:53

2 votes

1 answer

130 views

ValueError: unknown enum label "Hudi"

I am using dataproc 2.1 with the following software_config in a json file. "software_config": { "properties": {}, "optional_components": ["JUPYTER&...

user19930511

389

asked Dec 13, 2023 at 22:48

0 votes

0 answers

54 views

Create an email alert for a PySpark job executing on Google Dataproc

I am currently working on a PySpark job running on Google Dataproc and need to implement an email alert when. The goal is to receive notifications when specific events or conditions occur during the ...

Mercy

1

asked Nov 21, 2023 at 7:49

0 votes

1 answer

224 views

Error installing package from private repository on Dataproc cluster

In order to inizialize my dataproc cluster i'm trying to set up keyring authentication to a private pip repository, I followed the steps on Setting up authentication to Python package repositories - ...

Lorenzo Ventura

171

asked Nov 15, 2023 at 11:25

0 votes

1 answer

304 views

Invalid Argument When Creating Dataproc Cluster on GKE

I'm trying to create a dataproc cluster on GKE, through the gcp console using the following command gcloud dataproc clusters gke create dp-gke-cluster --region=europe-west1 --gke-cluster=gke-cluster --...

rqiu

46

asked Nov 8, 2023 at 6:27

-1 votes

1 answer

427 views

Not able to send Composer(airflow job) to dataproc

Im trying to submit an airflow job in composer (gcp) to dataproc. When i do this i get the following error ask failed with exception Traceback (most recent call last): File "/opt/python3.8/lib/...

luckyman80

13

asked Oct 11, 2023 at 15:43

0 votes

1 answer

28 views

Ordering of records in BigQuery tables while running DataProc jobs

Ordering of records in BigQuery tables while running DataProc jobs. While executing a Dataproc job on BigQuery tables, the order of output is changed. Even when you import a given file as a BigQuery ...

Manish Kukreja

1

asked Oct 5, 2023 at 6:58

2 votes

1 answer

361 views

java.lang.NoClassDefFoundError: org/apache/beam/sdk/coders/CoderProviderRegistrar

I'm getting this error while trying to submit a shaded jar to dataproc as a spark job : java.lang.NoClassDefFoundError: org/apache/beam/sdk/coders/CoderProviderRegistrar I'm sure that this class ...

Ahmed Hesham

39

asked Sep 30, 2023 at 22:23

1 vote

1 answer

299 views

Not able to write into BigQuery JSON Field with Pyspark

I am trying to write JSON object into a BigQuery Table Field. I am using GCP DATAPROC Batch Job. Could you please let me know if there is a way to write into BQ Table JSON Field with Spark. Dataproc ...

Bhargav Velisetti

47

asked Sep 27, 2023 at 15:06

0 votes

1 answer

345 views

Cluster validation error in GCP DataProc : Requested number of primary workers must be within autoscaler min/max range

We are trying to spin up a DataProc cluster in GCP. We get the below errors when we do so: The first error is related to primary worker group and the second is to secondary worker group of nodes. In ...

user3103957

848

asked Sep 27, 2023 at 9:04

0 votes

1 answer

409 views

PyFlink job encountering "No module named 'google'" error when using FlinkKafkaConsumer

I'm working on a PyFlink job that reads data from a Kafka topic using the FlinkKafkaConsumer connector. However, I'm encountering a persistent issue related to the google module when trying to run the ...

sam1064max

55

asked Aug 31, 2023 at 12:12

0 votes

1 answer

672 views

Dataproc Serverless for Spark Batch runs into a timeout after 4 hours

How can I increase the time limit? Logs show "Found 1 invalidated leases" and "Task srvls-batch-d702cb8b-1d45-44d2-bf2e-6bf6275f66bf lease grant was revoked, cancelling work." ...

Uttkarsh Berwal

1

asked Aug 30, 2023 at 14:27

1 vote

0 answers

159 views

PySpark Job on Dataproc Throws IOException but Still Completes Successfully

I'm running a PySpark job on Google Cloud Dataproc, which uses structured streaming with a trigger of 'once'. The job reads Parquet data from a raw layer (a GCS bucket), applies certain business rules,...

Puredepatata

33

asked Jul 28, 2023 at 15:28

1 vote

1 answer

302 views

Pub/Sub Publish message from Dataproc cluster using Python: ACCESS_TOKEN_SCOPE_INSUFFICIENT

I have a problem publishing a pub/sub message from dataproc cluster, from cloud function it works well with a service account, but with dataproc I got this error: raise exceptions.from_grpc_error(exc)...

sqoor

123

asked Jul 18, 2023 at 14:19

1 vote

1 answer

222 views

Configure trino-jvm properties in GCP Dataproc on cluster create

I'm trying to configure trino-jvm properties while creating a Dataproc cluster. I'm following Google's documentation and am able to successfully create a cluster without any special JVM configuration,...

Chris Noreikis

420

asked Jul 10, 2023 at 15:50

0 votes

1 answer

61 views

does google provide techincal support for dataproc's optional components ex. Ranger?

does google provide techincal support for dataproc's optional components ex. Ranger? if yes, can someone leave a link to verify?

ciao_bella

41

asked Jun 29, 2023 at 14:37

0 votes

2 answers

222 views

In Dataproc, whether or not the file prefix should be used when applying a property to job?

Actually the document explicitly states: When applying a property to a job, the file prefix is not used. However, the example given there is inconsistent with this This is what the page says: ......

figs_and_nuts

5,703

asked Jun 21, 2023 at 4:29

1 vote

1 answer

748 views

Pyspark with custom container on GCP Dataproc Serverless : access to class in custom container image

I’m trying to start a job Pyspark on GCP Dataproc Serverless with custom container, but when I tried to access to my main class in my custom image, I found this exception: Exception in thread "...

Sophie192

13

asked Jun 16, 2023 at 7:11

1 vote

1 answer

123 views

How to enable outside connection before submit Pyspark job to Dataproc

I have a Pyspark file which will be submitted to Dataproc. try: print("Start writing") url = "jdbc:postgresql://some-ip:5432/postgres" properties = { "...

Wasin Phaijith

23

asked Jun 14, 2023 at 5:23

2 votes

0 answers

400 views

Accumulator is failing to update on pyspark job ran on dataproc cluster

So the individual pyspark jobs do complete I can see that in the logs but while accumulation I get the following exception. The same piece of code when executed locally is working fine. The cluster ...

Rohan Aswani

43

asked Jun 12, 2023 at 15:43

2 votes

1 answer

176 views

Yarn CPU usage and the result of htop on workers are incosistent. I am running a SPARK cluster on Dataproc

I am on Dataproc managed spark cluster OS = Ubuntu 18.04 Spark version = 3.3.0 My cluster configuration is as follows: Master Memory = 7.5 GiB Cores = 2 Primary disk size = 32 GB Workers Cores = ...

figs_and_nuts

5,703

asked Jun 12, 2023 at 13:09

1 vote

1 answer

414 views

Yarn allocates only 1 core per container. Running spark on yarn

Please ensure dynamic allocation is not killing your containers while you monitor the YARN UI. See the answer below Issue: I can start the SparkSession with any number of cores per executor and the ...

figs_and_nuts

5,703

asked Jun 1, 2023 at 10:05

Collectives™ on Stack Overflow

Related Tags