Newest 'amazon-emr&' Questions

0 votes

0 answers

22 views

EMR: Pyspark on AWS graviton does not work

The same pyspark code works on r7a but not r7g or r8g on a EMR cluster (7.5). I build the python environment with conda, and use it in pyspark: conda create -n pyspark python=3.9 --show-channel-urls --...

Guillaume

2,819

asked Dec 6 at 14:35

0 votes

1 answer

23 views

apache-beam installation issue on AWS EMR-EC2 cluster

I started an AWS EMR-EC2 cluster, I am having trouble getting the sparkrunner of apache-beam to work. I have a python script that will use apache-beam. I have tried either aws emr add-steps or ssh ...

Shiyi Yin

1

asked Dec 6 at 6:19

0 votes

0 answers

45 views

Config params are not propagated when using Spark Connect

I am trying to get Spark Connect working on Amazon EMR (Spark v3.5.1). I started the Connect server on EMR primary node, making sure the JARs required for S3 auth are present in the Classpath: /usr/...

Ninad

71

asked Nov 29 at 12:02

-1 votes

2 answers

51 views

Jackson Databind Conflicts in Apache Spark Project Using Maven Shade Plugin

I am working on a project that processes IMDb data using Apache Spark. My setup involves Spark Core and Spark SQL dependencies, along with Jackson for handling JSON serialization and deserialization. ...

prashantjerk

1

asked Nov 28 at 4:28

0 votes

0 answers

30 views

OSCAR EMR API: Issues with AppointmentHistory Endpoint (REST and SOAP)

I’m working on integrating with OSCAR EMR, an open-source electronic medical record (EMR) system. My goal is to fetch appointment data (ideally for a specific patient) to validate and manage bookings ...

John Parker

1

asked Nov 24 at 6:39

0 votes

1 answer

34 views

Pyspark error: " Class org.apache.hadoop.fs.s3a.S3AFileSystem not found" in EMR 7.0.0

I am using EMR 7.0.0 version, which has python 3.9, spark 3.5.0, Hadoop 3.3.6 in AWS. I got the error: File "/usr/local/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/sql/...

TripleH

479

asked Nov 13 at 3:23

0 votes

1 answer

28 views

Jobs failed in airflow, despite Spark History UI jobs stuck in running. AWS Serverless

Has anyone ever experienced jobs failing in Airflow despite in Spark History UI, the jobs are still stuck in running. Also, I after I added a line of code to write the data to S3 (without reading it ...

laggyPC

19

asked Nov 2 at 12:41

0 votes

0 answers

74 views

trino large query is failing due to hive metastore failure

I have a large insert into query, which is reaching 99-100% quite fast (after 5-6 min), then completion percent starting to decrease and back to high number and repeating.. This query is reading from ...

roh

133

asked Oct 30 at 15:18

0 votes

1 answer

41 views

Flink Job Execution Fails with `NoClassDefFoundError` on AWS EMR with Python

I am trying to run a Flink job on an AWS EMR cluster (v7.3.0) using Python 3.9 and Apache Flink with PyFlink. My job reads from an AWS Kinesis stream and prints the stream data to console. However, ...

Mughees Asif

87

asked Oct 27 at 14:52

0 votes

0 answers

27 views

MSK, EMR, MWAA automation with ansible in AWS

I need to automate creation of MWAA environment, EMR cluster and MSK cluster in AWS as a part of custom environment build for the customer use case. I have ansible tool available for automation ...

saurabh umathe

395

asked Oct 15 at 17:03

-2 votes

1 answer

61 views

spark-submit using --py-files option could not find path to modules

I am trying to submit a pyspark job in EMR cluster. The code for job lies in a zipped package that is placed in S3 : /bin/spark-submit \ --py-files s3://my-dev/scripts/job-launchers/dev/pipeline.zip ...

Smruti Prakash Mohanty

33

asked Oct 11 at 7:22

1 vote

0 answers

49 views

Spark EMR executor container failing due to Java heap space

One of my Spark code is failing due to executor container failing due to "java.lang.OutOfMemoryError: Java heap space". Any recommendation is appreciated. I am using emr 200 -r7g16xlarge ...

user3858193

1,518

asked Sep 25 at 16:13

1 vote

0 answers

44 views

Optimizing PySpark Feature Engineering with Over a Billion Rows on EMR

I’m working with a large transaction dataset (~1 billion rows) in PySpark on AWS EMR. My goal is to perform feature engineering where I compute statistics like sum, mean, standard deviation, and ...

Meriiiiii

11

asked Sep 21 at 21:36

0 votes

0 answers

45 views

CancellationException in Apache Hudi

I have an Apache Hudi job that reads data for one day from S3 and writes it to a Hudi table with clustering enabled on a event_name column. My job was running fine for one month, but then one day ...

Ishan Sanganeria

355

asked Sep 21 at 17:40

0 votes

0 answers

58 views

Spark Job failed in EMR with exit code 137

The spark job runs in emr with 7.2 version. It's failing with below error. Any advice or tips to debug? Error Job aborted due to stage failure: Task 1518 in stage 14.0 failed 4 times, most recent ...

user3858193

1,518

asked Sep 19 at 12:09

1 vote

0 answers

57 views

Spark Shuffle FetchFailedException for large dataset in emr

My job takes a input data of 400 TB parquet dataset from s3. This job runs with 250 r716x large (each having 64 vCore, 488 GiB memory). The job fails with below error org.apache.spark.shuffle....

user3858193

1,518

asked Sep 12 at 14:43

1 vote

1 answer

64 views

Unable to add logs from my spark jobs to spark eventlogs

I am trying to output logs in spark event logs so that they are accessible in history server. I have tried two approaches Adding my own custom logger that extends Serializable extending org.apache....

Vikas Saxena

1,173

asked Sep 10 at 6:40

0 votes

1 answer

233 views

Where to define "EMR runtime role for cluster" for SageMaker users?

I am encountering this error while attempting to connect to my EMR Serverless Cluster from within the SageMaker Studio Notebook: Select EMR runtime execution role for cluster No available EMR ...

Kermit

5,862

asked Sep 8 at 12:17

0 votes

1 answer

26 views

Same Spark Job executing in Parallel for different days in different instance of AWS EMR have performance issues

While running the spark job (only one instance), it completes in 20-30 mins. However, the same code executes in multiple emr instance in parallel taking more time. Ex: I have 3 instance where each ...

user3858193

1,518

asked Aug 24 at 12:55

1 vote

2 answers

240 views

Amazon EMR 7.2 does not support Ganglia?

When I submit spark job using Amazon EMR 7.2, I got exception: software.amazon.awssdk.services.emr.model.EmrException: Specified application: Ganglia is invalid. (Service: Emr, Status Code: 400, ...

Weever

25

asked Aug 21 at 19:18

0 votes

0 answers

25 views

How to check spark job on Amazon EMR console by job submit with lvy

I can successfully submit spark job through Apache livy to Amazon EMR. But I did not see any steps on Amazon EMR console. Though my spark job was completed and success, I could only check job state ...

Weever

25

asked Aug 21 at 17:22

0 votes

0 answers

20 views

Unexplanined sporadic connection error when getting files from S3 with Spark

Context Between roughly 2024-08-10 20:00 and 2024-08-12 04:00, my half-hourly Scala/Spark EMR job failed numerous times, having not failed once in the several few years. Error in the logs: 24/08/12 02:...

Blair Nangle

1,521

asked Aug 14 at 8:50

0 votes

0 answers

29 views

Alter table rename column on Hudi tables on AWS EMR and AW S3

Using EMR 7.2 [root@ip-172-31-34-128 ~]# spark-sql --jars /usr/lib/hudi/hudi-spark-bundle.jar,/usr/lib/hudi/hudi-aws-bundle.jar --conf "spark.serializer=org.apache.spark.serializer.KryoSerializer&...

Albert T. Wong

1,643

asked Aug 14 at 2:24

0 votes

1 answer

45 views

Spark Shell with S3, Hudi and EMR. org.apache.hadoop.fs.s3a.S3AFileSystem not found

Following https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hudi-work-with-dataset.html spark-shell --jars /usr/lib/hudi/hudi-spark-bundle.jar \ --conf "spark.serializer=org.apache.spark....

Albert T. Wong

1,643

asked Aug 13 at 21:01

0 votes

1 answer

30 views

Running Spark Shell with Hudi, S3, EMR. Getting not authorized to perform: glue:GetDatabase on resource

the glue database "default" has already been created. scala> (inputDF.write | .format("hudi") | .options(hudiOptions) | .option(DataSourceWriteOptions....

Albert T. Wong

1,643

asked Aug 13 at 20:55

1 vote

1 answer

338 views

AQEShuffleRead in Spark Creating few partitions though advisoryPartitionSizeInBytes and initialPartitionNum is provided

I have added the spark.sql.adaptive.advisoryPartitionSizeInBytes=268435456 and spark.sql.adaptive.enabled=true. However my data size for each partition is more than 256 mb. I see the Dag where the ...

user3858193

1,518

asked Aug 8 at 23:28

0 votes

0 answers

148 views

ImportError when importing boto3 in EMR Serverless 7.2.0

I'm deploying an EMR Serverless application and using venv-pack to create the python environment for that application. The venv-pack zip is created inside a Docker container: FROM --platform=linux/...

Viktor Chekhovoi

72

asked Aug 7 at 14:03

0 votes

1 answer

42 views

unable to run start_notebook_execution for an EMR Jupyter notebook

Why is my AWS EMR Jupyter notebook saying that my cluster id is null? I've verified that 1) the cluster id value being passed is neither null, None, nor '' (empty string) and 2) confirmed that the ...

mwarrior

549

asked Aug 6 at 21:43

0 votes

0 answers

20 views

Epic Emr hyperdrive configuration

I am working as back-end developer in healthcare organization; we are integrating Epic EMR APIs and we need to check patient updated details in epic hyperdrive application. I have installed but don't ...

NorthHill

1

asked Aug 2 at 12:43

2 votes

3 answers

157 views

Write out the peak memory utilization of a Pyspark Job on EMR to a file

We run a lot of Pyspark jobs on EMR. The pipeline executed is the same, but the inputs can wildly change the peak memory utilization, and that utilization is growing over time. I would like to ...

TexasDev7062

112

asked Aug 2 at 3:47

1 vote

0 answers

32 views

Why are identical spark jobs taking longer to execute on emr cluster if they are submitted later?

I have an emr cluster that I am submitting 50 jobs to at the same time (about 3 minutes between the first submission and the last submission). I want all the jobs to run in parallel, and I should see ...

Nene Morales

11

asked Jul 29 at 23:10

0 votes

1 answer

60 views

Unable to access Public APIs using request library in EMR serverless

Getting below error when I am trying to fetch a API using request library. Traceback (most recent call last): File "/tmp/spark-39775710-130a-4403-9182-c557003f351b/lib.zip/urllib3/connection.py&...

Ashwini Kumar

1

asked Jul 27 at 17:40

0 votes

0 answers

51 views

spark on EMR error when using `foreachBatch`: "terminated with exception: Error while obtaining a new communication channel"

I use spark on EMR with versions: emr-6.13.0, Spark 3.4.1 i try to run a simple spark streaming job that read from kafka and write to memory-table using foreachBatch and get failure "Error while ...

shayms8

771

asked Jul 25 at 14:30

0 votes

0 answers

63 views

EMR Serverless SparkSession builder error: ClassNotFoundException issues

I am trying to create a job in EMR Studio to run in an EMR Serverless application. It's a relatively basic script to use PySpark to read some Athena tables, do some joins, create an output dataframe ...

si1287

1

asked Jul 23 at 11:22

0 votes

0 answers

36 views

Does spark shuffle/exchange converts compress data to uncompress form?

I have input dataset which is 450gb in s3 parquet compressed format. However during exchange it's showing 10 TB. is there any way to tune it. Tow large table are getting joined and and no other ...

user3858193

1,518

asked Jul 21 at 22:03

0 votes

1 answer

78 views

Spark-Scala vs Pyspark Dag is different?

I am converting pyspark job to Scala and jobs executes in emr. The parameter and data and code is same. However I see the run time is different and so also the dag getting created is different. Here I ...

user3858193

1,518

asked Jul 19 at 17:39

0 votes

0 answers

38 views

Apache oozie JA008 error - job state changed from SUCCEDED to FAILED

I'm running oozie HA 5.2.1 on EMR and I have an issue with this temporary directory. I have a workflow which has start node -> action node -> end node. The job start running -> runs for 10-15 ...

Stefan Ss

65

asked Jul 18 at 12:15

0 votes

0 answers

36 views

AWS EMR - reading multiple "zip" files from S3 bucket returns Your key is too long

In my daily job I use EMR to process large amount of data. This data are stored in CSV files on S3 bucket. The idea I had was to try to process ziped csv files instead of plain csv. In Hive app I use ...

Vape

131

asked Jul 18 at 8:14

0 votes

0 answers

27 views

Airflow error while creating EMR cluster via DAG

I am looking to create an EMR cluster via airflow DAG using EmrCreateJobFlowOperator using a role called dev-emr-ec2-profile-role for jobFlow. This role is used to provision EMR cluster via Terraform ...

Anngva82

1

asked Jul 16 at 15:11

5 votes

0 answers

100 views

Spark Repartition/shuffle optimization

I am trying to repartiton before applying any transformation logic. This takes a lot of time. Here is code and snapshot of UI below. Any optimization can be applied here?. Cluster: AWS EMR,200 Task ...

user3858193

1,518

asked Jul 10 at 20:02

1 vote

0 answers

108 views

Spark EMR Shuffle Read Fetch Wait Time is in 4hrs

One of my spark job failed due emr-spark-shuffle-fetchfailedexception-with-65tb-data-with-aqe-enabled has high Shuffle Read Fetch Wait Time. is there any way it can be improved. Spark-submit spark-...

user3858193

1,518

asked Jul 7 at 17:56

0 votes

0 answers

89 views

Troubleshooting Kafka Integration with Spark Streaming on Amazon EMR Serverless

Objective: To set up a streaming job on Amazon EMR Serverless to process weather data from Amazon MSK (Managed Streaming for Apache Kafka) and write the word count results to an S3 bucket. Steps Taken:...

user26129742

1

asked Jul 7 at 11:04

0 votes

1 answer

118 views

EMR-Spark Job creating max 1000 partitions/task when AQE is enabled

I see always 1000 task/partitions getting created for a spark jobs with AQE enabled. If I execute job for monthly(4 times weekly data) or a week data, the shuffle partitions are same.Whis is nothing ...

user3858193

1,518

asked Jul 5 at 14:42

0 votes

2 answers

135 views

What does retry in SparkUI means?

I have spark executed in two different instances: spark.sql.adaptive.coalescePartitions.enabled=false spark.sql.adaptive.coalescePartitions.enabled=true In the first instance, the stage graphs have ...

user3858193

1,518

asked Jul 5 at 10:54

0 votes

0 answers

35 views

ClassCastException in Spark SQL Incremental Load with DBT

I'm encountering a ClassCastException error when running an incremental load using DBT and Spark SQL. The error message indicates an issue with casting in the Spark execution plan: org.apache.hive....

Raul Zinezi

1

asked Jul 4 at 15:23

1 vote

1 answer

83 views

Spark emr jobs: Is the number of task defined by AQE (adaptive.enabled)?

I see the number of task in spark job is only 1000 after initial read, where as number of cores available is 9000 (1800 executors*5 core each). I have enabled aqe and coalesce shuffle partition. In ...

user3858193

1,518

asked Jul 2 at 12:10

0 votes

1 answer

277 views

How to enable "Use for Hive table metadata" in "AWS Glue Data Catalog settings" using Terraform?

I am using Terraform to set up Trino cluster managed by Amazon EMR. Here is my Terraform code: resource "aws_emr_cluster" "hm_amazon_emr_cluster" { name ...

Hongbo Miao

49.5k

asked Jun 29 at 9:01

0 votes

0 answers

106 views

Spark EMR long running transformation job GC is taking more time

I have transformation job running in spark emr (7.1). This job is compute intensive as it calculates approx_percentile and other aggregated function. The GC is taking more time. I would like to know ...

user3858193

1,518

asked Jun 27 at 14:01

0 votes

0 answers

63 views

AWS EMR Jupyterhub notebook run fails with error: Session isn't active

I noticed this question is asked a lot but the solution proposed in other threads do not work for me. I have created an EMR cluster on AWS and I am running a quite time intensive code. The notebook ...

Francesco Pegoraro

808

asked Jun 26 at 8:19

2 votes

1 answer

133 views

Spark aggregate on multiple columns or a hash

Supposed I want to drop duplicates or perform an aggregation on 3 columns in my Spark dataframe. Would it be more optimal to do df = df.withColumn( "hash_dup", f.hash( ...

bmcristi

103

asked Jun 19 at 12:15

Collectives™ on Stack Overflow

Related Tags