3,405 questions
0
votes
0
answers
22
views
EMR: Pyspark on AWS graviton does not work
The same pyspark code works on r7a but not r7g or r8g on a EMR cluster (7.5).
I build the python environment with conda, and use it in pyspark:
conda create -n pyspark python=3.9 --show-channel-urls --...
0
votes
1
answer
23
views
apache-beam installation issue on AWS EMR-EC2 cluster
I started an AWS EMR-EC2 cluster, I am having trouble getting the sparkrunner of apache-beam to work.
I have a python script that will use apache-beam. I have tried either aws emr add-steps or ssh ...
0
votes
0
answers
45
views
Config params are not propagated when using Spark Connect
I am trying to get Spark Connect working on Amazon EMR (Spark v3.5.1). I started the Connect server on EMR primary node, making sure the JARs required for S3 auth are present in the Classpath:
/usr/...
-1
votes
2
answers
51
views
Jackson Databind Conflicts in Apache Spark Project Using Maven Shade Plugin
I am working on a project that processes IMDb data using Apache Spark. My setup involves Spark Core and Spark SQL dependencies, along with Jackson for handling JSON serialization and deserialization. ...
0
votes
0
answers
30
views
OSCAR EMR API: Issues with AppointmentHistory Endpoint (REST and SOAP)
I’m working on integrating with OSCAR EMR, an open-source electronic medical record (EMR) system. My goal is to fetch appointment data (ideally for a specific patient) to validate and manage bookings ...
0
votes
1
answer
34
views
Pyspark error: " Class org.apache.hadoop.fs.s3a.S3AFileSystem not found" in EMR 7.0.0
I am using EMR 7.0.0 version, which has python 3.9, spark 3.5.0, Hadoop 3.3.6 in AWS.
I got the error:
File "/usr/local/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/sql/...
0
votes
1
answer
28
views
Jobs failed in airflow, despite Spark History UI jobs stuck in running. AWS Serverless
Has anyone ever experienced jobs failing in Airflow despite in Spark History UI, the jobs are still stuck in running. Also, I after I added a line of code to write the data to S3 (without reading it ...
0
votes
0
answers
74
views
trino large query is failing due to hive metastore failure
I have a large insert into query, which is reaching 99-100% quite fast (after 5-6 min), then completion percent starting to decrease and back to high number and repeating..
This query is reading from ...
0
votes
1
answer
41
views
Flink Job Execution Fails with `NoClassDefFoundError` on AWS EMR with Python
I am trying to run a Flink job on an AWS EMR cluster (v7.3.0) using Python 3.9 and Apache Flink with PyFlink. My job reads from an AWS Kinesis stream and prints the stream data to console. However, ...
0
votes
0
answers
27
views
MSK, EMR, MWAA automation with ansible in AWS
I need to automate creation of MWAA environment, EMR cluster and MSK cluster in AWS as a part of custom environment build for the customer use case.
I have ansible tool available for automation ...
-2
votes
1
answer
61
views
spark-submit using --py-files option could not find path to modules
I am trying to submit a pyspark job in EMR cluster. The code for job lies in a zipped package that is placed in S3 :
/bin/spark-submit \
--py-files s3://my-dev/scripts/job-launchers/dev/pipeline.zip ...
1
vote
0
answers
49
views
Spark EMR executor container failing due to Java heap space
One of my Spark code is failing due to executor container failing due to "java.lang.OutOfMemoryError: Java heap space". Any recommendation is appreciated.
I am using emr 200 -r7g16xlarge ...
1
vote
0
answers
44
views
Optimizing PySpark Feature Engineering with Over a Billion Rows on EMR
I’m working with a large transaction dataset (~1 billion rows) in PySpark on AWS EMR. My goal is to perform feature engineering where I compute statistics like sum, mean, standard deviation, and ...
0
votes
0
answers
45
views
CancellationException in Apache Hudi
I have an Apache Hudi job that reads data for one day from S3 and writes it to a Hudi table with clustering enabled on a event_name column. My job was running fine for one month, but then one day ...
0
votes
0
answers
58
views
Spark Job failed in EMR with exit code 137
The spark job runs in emr with 7.2 version. It's failing with below error. Any advice or tips to debug?
Error
Job aborted due to stage failure: Task 1518 in stage 14.0 failed 4 times, most recent ...
1
vote
0
answers
57
views
Spark Shuffle FetchFailedException for large dataset in emr
My job takes a input data of 400 TB parquet dataset from s3. This job runs with 250 r716x large (each having 64 vCore, 488 GiB memory). The job fails with below error org.apache.spark.shuffle....
1
vote
1
answer
64
views
Unable to add logs from my spark jobs to spark eventlogs
I am trying to output logs in spark event logs so that they are accessible in history server.
I have tried two approaches
Adding my own custom logger that extends Serializable
extending org.apache....
0
votes
1
answer
233
views
Where to define "EMR runtime role for cluster" for SageMaker users?
I am encountering this error while attempting to connect to my EMR Serverless Cluster from within the SageMaker Studio Notebook:
Select EMR runtime execution role for cluster
No available EMR ...
0
votes
1
answer
26
views
Same Spark Job executing in Parallel for different days in different instance of AWS EMR have performance issues
While running the spark job (only one instance), it completes in 20-30 mins. However, the same code executes in multiple emr instance in parallel taking more time. Ex: I have 3 instance where each ...
1
vote
2
answers
240
views
Amazon EMR 7.2 does not support Ganglia?
When I submit spark job using Amazon EMR 7.2, I got exception:
software.amazon.awssdk.services.emr.model.EmrException: Specified
application: Ganglia is invalid. (Service: Emr, Status Code: 400,
...
0
votes
0
answers
25
views
How to check spark job on Amazon EMR console by job submit with lvy
I can successfully submit spark job through Apache livy to Amazon EMR.
But I did not see any steps on Amazon EMR console.
Though my spark job was completed and success, I could only check job state ...
0
votes
0
answers
20
views
Unexplanined sporadic connection error when getting files from S3 with Spark
Context
Between roughly 2024-08-10 20:00 and 2024-08-12 04:00, my half-hourly Scala/Spark EMR job failed numerous times, having not failed once in the several few years.
Error in the logs:
24/08/12 02:...
0
votes
0
answers
29
views
Alter table rename column on Hudi tables on AWS EMR and AW S3
Using EMR 7.2
[root@ip-172-31-34-128 ~]# spark-sql --jars /usr/lib/hudi/hudi-spark-bundle.jar,/usr/lib/hudi/hudi-aws-bundle.jar --conf "spark.serializer=org.apache.spark.serializer.KryoSerializer&...
0
votes
1
answer
45
views
Spark Shell with S3, Hudi and EMR. org.apache.hadoop.fs.s3a.S3AFileSystem not found
Following https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hudi-work-with-dataset.html
spark-shell --jars /usr/lib/hudi/hudi-spark-bundle.jar \
--conf "spark.serializer=org.apache.spark....
0
votes
1
answer
30
views
Running Spark Shell with Hudi, S3, EMR. Getting not authorized to perform: glue:GetDatabase on resource
the glue database "default" has already been created.
scala> (inputDF.write
| .format("hudi")
| .options(hudiOptions)
| .option(DataSourceWriteOptions....
1
vote
1
answer
338
views
AQEShuffleRead in Spark Creating few partitions though advisoryPartitionSizeInBytes and initialPartitionNum is provided
I have added the spark.sql.adaptive.advisoryPartitionSizeInBytes=268435456 and spark.sql.adaptive.enabled=true. However my data size for each partition is more than 256 mb. I see the Dag where the ...
0
votes
0
answers
148
views
ImportError when importing boto3 in EMR Serverless 7.2.0
I'm deploying an EMR Serverless application and using venv-pack to create the python environment for that application. The venv-pack zip is created inside a Docker container:
FROM --platform=linux/...
0
votes
1
answer
42
views
unable to run start_notebook_execution for an EMR Jupyter notebook
Why is my AWS EMR Jupyter notebook saying that my cluster id is null? I've verified that 1) the cluster id value being passed is neither null, None, nor '' (empty string) and 2) confirmed that the ...
0
votes
0
answers
20
views
Epic Emr hyperdrive configuration
I am working as back-end developer in healthcare organization; we are integrating Epic EMR APIs and we need to check patient updated details in epic hyperdrive application.
I have installed but don't ...
2
votes
3
answers
157
views
Write out the peak memory utilization of a Pyspark Job on EMR to a file
We run a lot of Pyspark jobs on EMR. The pipeline executed is the same, but the inputs can wildly change the peak memory utilization, and that utilization is growing over time. I would like to ...
1
vote
0
answers
32
views
Why are identical spark jobs taking longer to execute on emr cluster if they are submitted later?
I have an emr cluster that I am submitting 50 jobs to at the same time (about 3 minutes between the first submission and the last submission). I want all the jobs to run in parallel, and I should see ...
0
votes
1
answer
60
views
Unable to access Public APIs using request library in EMR serverless
Getting below error when I am trying to fetch a API using request library.
Traceback (most recent call last):
File "/tmp/spark-39775710-130a-4403-9182-c557003f351b/lib.zip/urllib3/connection.py&...
0
votes
0
answers
51
views
spark on EMR error when using `foreachBatch`: "terminated with exception: Error while obtaining a new communication channel"
I use spark on EMR with versions: emr-6.13.0, Spark 3.4.1
i try to run a simple spark streaming job that read from kafka and write to memory-table using foreachBatch and get failure "Error while ...
0
votes
0
answers
63
views
EMR Serverless SparkSession builder error: ClassNotFoundException issues
I am trying to create a job in EMR Studio to run in an EMR Serverless application. It's a relatively basic script to use PySpark to read some Athena tables, do some joins, create an output dataframe ...
0
votes
0
answers
36
views
Does spark shuffle/exchange converts compress data to uncompress form?
I have input dataset which is 450gb in s3 parquet compressed format. However during exchange it's showing 10 TB. is there any way to tune it. Tow large table are getting joined and and no other ...
0
votes
1
answer
78
views
Spark-Scala vs Pyspark Dag is different?
I am converting pyspark job to Scala and jobs executes in emr. The parameter and data and code is same. However I see the run time is different and so also the dag getting created is different. Here I ...
0
votes
0
answers
38
views
Apache oozie JA008 error - job state changed from SUCCEDED to FAILED
I'm running oozie HA 5.2.1 on EMR and I have an issue with this temporary directory. I have a workflow which has start node -> action node -> end node. The job start running -> runs for 10-15 ...
0
votes
0
answers
36
views
AWS EMR - reading multiple "zip" files from S3 bucket returns Your key is too long
In my daily job I use EMR to process large amount of data. This data are stored in CSV files on S3 bucket. The idea I had was to try to process ziped csv files instead of plain csv.
In Hive app I use ...
0
votes
0
answers
27
views
Airflow error while creating EMR cluster via DAG
I am looking to create an EMR cluster via airflow DAG using EmrCreateJobFlowOperator using a role called dev-emr-ec2-profile-role for jobFlow. This role is used to provision EMR cluster via Terraform ...
5
votes
0
answers
100
views
Spark Repartition/shuffle optimization
I am trying to repartiton before applying any transformation logic. This takes a lot of time. Here is code and snapshot of UI below. Any optimization can be applied here?.
Cluster: AWS EMR,200 Task ...
1
vote
0
answers
108
views
Spark EMR Shuffle Read Fetch Wait Time is in 4hrs
One of my spark job failed due emr-spark-shuffle-fetchfailedexception-with-65tb-data-with-aqe-enabled has high Shuffle Read Fetch Wait Time. is there any way it can be improved.
Spark-submit
spark-...
0
votes
0
answers
89
views
Troubleshooting Kafka Integration with Spark Streaming on Amazon EMR Serverless
Objective:
To set up a streaming job on Amazon EMR Serverless to process weather data from Amazon MSK (Managed Streaming for Apache Kafka) and write the word count results to an S3 bucket.
Steps Taken:...
0
votes
1
answer
118
views
EMR-Spark Job creating max 1000 partitions/task when AQE is enabled
I see always 1000 task/partitions getting created for a spark jobs with AQE enabled. If I execute job for monthly(4 times weekly data) or a week data, the shuffle partitions are same.Whis is nothing ...
0
votes
2
answers
135
views
What does retry in SparkUI means?
I have spark executed in two different instances:
spark.sql.adaptive.coalescePartitions.enabled=false
spark.sql.adaptive.coalescePartitions.enabled=true
In the first instance, the stage graphs have ...
0
votes
0
answers
35
views
ClassCastException in Spark SQL Incremental Load with DBT
I'm encountering a ClassCastException error when running an incremental load using DBT and Spark SQL. The error message indicates an issue with casting in the Spark execution plan:
org.apache.hive....
1
vote
1
answer
83
views
Spark emr jobs: Is the number of task defined by AQE (adaptive.enabled)?
I see the number of task in spark job is only 1000 after initial read, where as number of cores available is 9000 (1800 executors*5 core each). I have enabled aqe and coalesce shuffle partition. In ...
0
votes
1
answer
277
views
How to enable "Use for Hive table metadata" in "AWS Glue Data Catalog settings" using Terraform?
I am using Terraform to set up Trino cluster managed by Amazon EMR.
Here is my Terraform code:
resource "aws_emr_cluster" "hm_amazon_emr_cluster" {
name ...
0
votes
0
answers
106
views
Spark EMR long running transformation job GC is taking more time
I have transformation job running in spark emr (7.1). This job is compute intensive as it calculates approx_percentile and other aggregated function. The GC is taking more time. I would like to know ...
0
votes
0
answers
63
views
AWS EMR Jupyterhub notebook run fails with error: Session isn't active
I noticed this question is asked a lot but the solution proposed in other threads do not work for me.
I have created an EMR cluster on AWS and I am running a quite time intensive code. The notebook ...
2
votes
1
answer
133
views
Spark aggregate on multiple columns or a hash
Supposed I want to drop duplicates or perform an aggregation on 3 columns in my Spark dataframe.
Would it be more optimal to do
df = df.withColumn(
"hash_dup",
f.hash(
...