151 questions
1
vote
1
answer
76
views
Creating Dataproc Cluster with public-ip-address using DataprocCreateClusterOperator in Airflow
I am trying to create a Dataproc Cluster in my GCP project within an Airflow DAG using the DataprocCreateClusterOperator. I am using the ClusterGenerator to generate the config for the cluster. ...
0
votes
0
answers
34
views
dataproc cluster launching failing
I cannot create a dataproc cluster on compute engine in a given vpc/subnet.
even though the firewall rule exists in said vpc.
Firewall rule
Initialisation failed. Enabling Diagnostic mode com.google....
1
vote
0
answers
55
views
SecretManagerClient.create() raise NoSuchMethodError io.grpc.MethodDescriptor$Marshaller with Scala on Dataproc Batch
I'm trying to read from google secretmanager. I deploy my scala/spark as Fat JAR on dataproc with SBT.
When I call :
val client: SecretManagerServiceClient = SecretManagerServiceClient.create()
I ...
-1
votes
1
answer
54
views
running normal java non-spark application in spark cluster
I want to run/execute a normal java application which connects to teradata database.
I would like to run this java app in spark cluster although my java app is non-spark.
Questions are as follows
Is ...
0
votes
2
answers
73
views
Is there a way to pass the dataproc version via the Airflow DataprocCreateBatchOperator method?
I've just run into an issue where the default version of dataproc upgraded and broke my job which is being submitted using the DataprocCreateBatchOperator method through airflow.
task2 = ...
0
votes
1
answer
90
views
ModuleNotFoundError: No module named 'minio' when submitting a PySpark job on Google Cloud Dataproc
I’m facing an issue when trying to submit a PySpark job to Google Cloud Dataproc. The goal is to run a script on the Dataproc cluster that uses the minio module. However, I keep encountering the ...
0
votes
0
answers
26
views
Dataproc : unable to find environment properties by archive path
I submit a dataproc job as below:
gcloud dataproc jobs submit pyspark \
gs://persopnal/test/file1.py \
--py-files gs://persopnal/test/file1.py,gs://persopnal/test/file2.py \
--jars $jar \
--archieves ...
0
votes
0
answers
17
views
Unable to send ssl file path as URL in dataproc ,spark.read
I am using JDBC to connect to DB2 using Dataproc/Pyspark.
Since it uses secure port, it requires some SSL information as part of the URL:
Here's the format:
jdbc:db2://IP_Addr:PORT/DB:sslConnection=...
0
votes
1
answer
47
views
Creation of Component Gateway Viewer (CGV) in GCP Dataproc cluster behaves differently
I have a weird context with respect to component gateway viewer (CGV) creation in GCP Dataproc cluster. I create the Dataproc cluster along with CGV via Terraform as described below:
#Cluster is ...
1
vote
0
answers
149
views
Issue with properly setting up a Spark Session (Dataproc) to my Apache Iceberg BigLake tables
I've successfully set up an Iceberg table with a BigLake Metastore as per this documentation:
https://cloud.google.com/bigquery/docs/iceberg-tables
Everything works, I can see the Iceberg table in ...
1
vote
0
answers
85
views
Bigtable Read and Write using DataProc with compute engine results in Key not found
I am experimenting with reading and writing data in cloud BigTable using the DataProc compute engine and PySpark Job using spark-bigtable-connector. I got an example from spark-bigtable repo and ...
0
votes
0
answers
75
views
GCP Data Proc cluster creation Warning
Hey guys I am new to cloud and GCP. Encountering this warning message while creating a data proc cluster using the GCP cli. I have the firewall rule set for Dataproc to function. Yet getting this ...
0
votes
1
answer
122
views
Error while scanning intermediate done dir - dataproc spark job
Our spark aggregation jobs are taking a lot of execution time to complete. It supposed to complete in 5 mins but taking 30 to 40 minutes to complete.
dataproc cluster logging say it's trying to scan ...
1
vote
0
answers
63
views
dataproc ERROR ClusterManager: Could not initialize cluster nodes
I have created in Dataproc a cluster with 1 master node and 5 workers.
When I run a PySpark job I get this error,
ERROR ClusterManager: Could not initialize cluster nodes
currentNodeIndex=null
I ...
0
votes
1
answer
183
views
Set Spark configuration when running python in dbt for BigQuery
Making some progress on a proof of concept for a python dbt model in GCP (BigQuery). Built a dataproc cluster for Spark and able to execute the model, but I'm getting an error in the model that ...
0
votes
0
answers
37
views
Optimization of my spark dataproc cluster job
Below is my dataproc job log to process around 50GB of data
Logs:
Time spent in Driver vs Executors
Driver WallClock Time 129m 50s 80.96%
Executor WallClock Time 30m 32s 19.04%
...
0
votes
0
answers
25
views
Error "failed to list the clusters" creating cloud serverless runtime template
I try to create a dataproc cloud serverless runtime template to get a spark environment notebook in a vertex ai workbench instance.
If click on the "New Runtime Template" in category ...
0
votes
1
answer
168
views
Migration from Spark 3.1 to 3.3 org.apache.spark.shuffle.FetchFailedException failures
Pipeline is running fine in spark 3.1 version (dataproc 2), while migrating to spark 3.3 (Dataproc 2.1), job is failing with below error.
FetchFailed(BlockManagerId(246, ---internal, 7337, None), ...
0
votes
0
answers
54
views
PySpark on Dataproc out of memory
I am trying to load 1.5 million rows with into a BigQuery table using PySpark:
spark = SparkSession.builder.appName('process-prices') \
.config("spark.memory.offHeap.enabled", "...
0
votes
0
answers
19
views
Imports failing with workaround in Google Dataproc Cluster Notebooks
I am running several experiments in notebooks running on a Spark Dataproc cluster. Many of the functions stay the same from experiment to experiment (such as data preprocessing). But, after some ...
1
vote
0
answers
118
views
How to run a Spark job on Dataproc with custom conda env file
I'm trying to run a Spark job on Dataproc with a custom conda environment. Here's my environment yaml file:
name: parallel-jobs-on-dataproc
channels:
- default
dependencies:
- python=3.11
- ...
0
votes
2
answers
95
views
CDF custom Plugin on DataProc - Provider for class javax.xml.parsers.DocumentBuilderFactory cannot be created
To set the context, I have created a custom CDAP transform plugin and tested it in CDAP local successfully. However when I deploy it to GCP CDF instance, everything works fine till preview mode but ...
1
vote
1
answer
426
views
Python Dependecies for DataprocCreateBatchOperator
Cannot submit Python job onto dataproc serverless when third party python dependencies are needed. It's working fine when dependencies are not needed. I push up the pyspark python file to a cloud ...
1
vote
0
answers
514
views
GCP DataProc Serverless - VPC/subnet/firewall requirements
I'm trying my first GCP Dataproc Serverless PySpark batch job, which needs to connect to a public REST endpoint, and write to a GCS bucket in the same GCP project. After having submitted, the job was ...
0
votes
0
answers
56
views
Unable to find Dataproc Yarn aggregated and spark driver logs in GCP Cloud Logging
I am following this document while creating a new Dataproc cluster. I have set Dataproc properties for Spark driver logs in this manner -
spark:spark.submit.deployMode = cluster
dataproc:dataproc....
0
votes
0
answers
40
views
FileNotFoundException for temporary file when runs Spark on Dataproc/Yarn
I need download a binary file with proprietary format, make a conversion, then move converted back to storage.
I create directory/file on /tmp, using java Files.createTempDirectory and make the ...
0
votes
0
answers
29
views
Submitting requests to a job running in a Dataproc cluster in GCP
I have a Dataproc cluster running in GCP. The cluster has a spark job which is always running and listening for requests.
Upon receiving the requests, it needs to process it and returns the result in ...
0
votes
0
answers
278
views
Dataproc spark job (long running) on cloudrun on Gcp
I had a small doubt to clarify as an expert advice
Can I run dataproc (pyspark job) an an image on cloud run (service or job) as dataproc job may take few minutes to hours to complete so it will ...
0
votes
1
answer
194
views
How to change log level in dataproc serverless spark
In dataproc serverless, the underlying vm is hidden. How to change log level in dataproc serverless spark?
--properties ^@^spark.executor.extraJavaOptions=-Dlog4j.configuration=spark-log4j2.properties ...
2
votes
1
answer
130
views
ValueError: unknown enum label "Hudi"
I am using dataproc 2.1 with the following software_config in a json file.
"software_config": {
"properties": {},
"optional_components": ["JUPYTER&...
0
votes
0
answers
54
views
Create an email alert for a PySpark job executing on Google Dataproc
I am currently working on a PySpark job running on Google Dataproc and need to implement an email alert when. The goal is to receive notifications when specific events or conditions occur during the ...
0
votes
1
answer
224
views
Error installing package from private repository on Dataproc cluster
In order to inizialize my dataproc cluster i'm trying to set up keyring authentication to a private pip repository, I followed the steps on Setting up authentication to Python package repositories - ...
0
votes
1
answer
304
views
Invalid Argument When Creating Dataproc Cluster on GKE
I'm trying to create a dataproc cluster on GKE, through the gcp console using the following command
gcloud dataproc clusters gke create dp-gke-cluster --region=europe-west1 --gke-cluster=gke-cluster --...
-1
votes
1
answer
427
views
Not able to send Composer(airflow job) to dataproc
Im trying to submit an airflow job in composer (gcp) to dataproc. When i do this i get the following error
ask failed with exception Traceback (most recent call last): File "/opt/python3.8/lib/...
0
votes
1
answer
28
views
Ordering of records in BigQuery tables while running DataProc jobs
Ordering of records in BigQuery tables while running DataProc jobs.
While executing a Dataproc job on BigQuery tables, the order of output is changed. Even when you import a given file as a BigQuery ...
2
votes
1
answer
361
views
java.lang.NoClassDefFoundError: org/apache/beam/sdk/coders/CoderProviderRegistrar
I'm getting this error while trying to submit a shaded jar to dataproc as a spark job :
java.lang.NoClassDefFoundError: org/apache/beam/sdk/coders/CoderProviderRegistrar
I'm sure that this class ...
1
vote
1
answer
299
views
Not able to write into BigQuery JSON Field with Pyspark
I am trying to write JSON object into a BigQuery Table Field. I am using GCP DATAPROC Batch Job.
Could you please let me know if there is a way to write into BQ Table JSON Field with Spark.
Dataproc ...
0
votes
1
answer
345
views
Cluster validation error in GCP DataProc : Requested number of primary workers must be within autoscaler min/max range
We are trying to spin up a DataProc cluster in GCP. We get the below errors when we do so:
The first error is related to primary worker group and the second is to secondary worker group of nodes.
In ...
0
votes
1
answer
409
views
PyFlink job encountering "No module named 'google'" error when using FlinkKafkaConsumer
I'm working on a PyFlink job that reads data from a Kafka topic using the FlinkKafkaConsumer connector. However, I'm encountering a persistent issue related to the google module when trying to run the ...
0
votes
1
answer
672
views
Dataproc Serverless for Spark Batch runs into a timeout after 4 hours
How can I increase the time limit?
Logs show "Found 1 invalidated leases" and "Task srvls-batch-d702cb8b-1d45-44d2-bf2e-6bf6275f66bf lease grant was revoked, cancelling work."
...
1
vote
0
answers
159
views
PySpark Job on Dataproc Throws IOException but Still Completes Successfully
I'm running a PySpark job on Google Cloud Dataproc, which uses structured streaming with a trigger of 'once'. The job reads Parquet data from a raw layer (a GCS bucket), applies certain business rules,...
1
vote
1
answer
302
views
Pub/Sub Publish message from Dataproc cluster using Python: ACCESS_TOKEN_SCOPE_INSUFFICIENT
I have a problem publishing a pub/sub message from dataproc cluster, from cloud function it works well with a service account, but with dataproc I got this error:
raise exceptions.from_grpc_error(exc)...
1
vote
1
answer
222
views
Configure trino-jvm properties in GCP Dataproc on cluster create
I'm trying to configure trino-jvm properties while creating a Dataproc cluster. I'm following Google's documentation and am able to successfully create a cluster without any special JVM configuration,...
0
votes
1
answer
61
views
does google provide techincal support for dataproc's optional components ex. Ranger?
does google provide techincal support for dataproc's optional components ex. Ranger?
if yes, can someone leave a link to verify?
0
votes
2
answers
222
views
In Dataproc, whether or not the file prefix should be used when applying a property to job?
Actually the document explicitly states:
When applying a property to a job, the file prefix is not used.
However, the example given there is inconsistent with this
This is what the page says:
......
1
vote
1
answer
748
views
Pyspark with custom container on GCP Dataproc Serverless : access to class in custom container image
I’m trying to start a job Pyspark on GCP Dataproc Serverless with custom container, but when I tried to access to my main class in my custom image, I found this exception:
Exception in thread "...
1
vote
1
answer
123
views
How to enable outside connection before submit Pyspark job to Dataproc
I have a Pyspark file which will be submitted to Dataproc.
try:
print("Start writing")
url = "jdbc:postgresql://some-ip:5432/postgres"
properties = {
"...
2
votes
0
answers
400
views
Accumulator is failing to update on pyspark job ran on dataproc cluster
So the individual pyspark jobs do complete I can see that in the logs but while accumulation I get the following exception. The same piece of code when executed locally is working fine.
The cluster ...
2
votes
1
answer
176
views
Yarn CPU usage and the result of htop on workers are incosistent. I am running a SPARK cluster on Dataproc
I am on Dataproc managed spark cluster
OS = Ubuntu 18.04
Spark version = 3.3.0
My cluster configuration is as follows:
Master
Memory = 7.5 GiB
Cores = 2
Primary disk size = 32 GB
Workers
Cores = ...
1
vote
1
answer
414
views
Yarn allocates only 1 core per container. Running spark on yarn
Please ensure dynamic allocation is not killing your containers while you monitor the YARN UI. See the answer below
Issue: I can start the SparkSession with any number of cores per executor and the ...