Skip to main content
Filter by
Sorted by
Tagged with
0 votes
0 answers
22 views

EMR: Pyspark on AWS graviton does not work

The same pyspark code works on r7a but not r7g or r8g on a EMR cluster (7.5). I build the python environment with conda, and use it in pyspark: conda create -n pyspark python=3.9 --show-channel-urls --...
Guillaume's user avatar
  • 2,819
0 votes
1 answer
23 views

apache-beam installation issue on AWS EMR-EC2 cluster

I started an AWS EMR-EC2 cluster, I am having trouble getting the sparkrunner of apache-beam to work. I have a python script that will use apache-beam. I have tried either aws emr add-steps or ssh ...
Shiyi Yin's user avatar
0 votes
0 answers
45 views

Config params are not propagated when using Spark Connect

I am trying to get Spark Connect working on Amazon EMR (Spark v3.5.1). I started the Connect server on EMR primary node, making sure the JARs required for S3 auth are present in the Classpath: /usr/...
Ninad's user avatar
  • 71
-1 votes
2 answers
51 views

Jackson Databind Conflicts in Apache Spark Project Using Maven Shade Plugin

I am working on a project that processes IMDb data using Apache Spark. My setup involves Spark Core and Spark SQL dependencies, along with Jackson for handling JSON serialization and deserialization. ...
prashantjerk's user avatar
0 votes
0 answers
30 views

OSCAR EMR API: Issues with AppointmentHistory Endpoint (REST and SOAP)

I’m working on integrating with OSCAR EMR, an open-source electronic medical record (EMR) system. My goal is to fetch appointment data (ideally for a specific patient) to validate and manage bookings ...
John Parker's user avatar
0 votes
1 answer
34 views

Pyspark error: " Class org.apache.hadoop.fs.s3a.S3AFileSystem not found" in EMR 7.0.0

I am using EMR 7.0.0 version, which has python 3.9, spark 3.5.0, Hadoop 3.3.6 in AWS. I got the error: File "/usr/local/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/sql/...
TripleH's user avatar
  • 479
0 votes
1 answer
28 views

Jobs failed in airflow, despite Spark History UI jobs stuck in running. AWS Serverless

Has anyone ever experienced jobs failing in Airflow despite in Spark History UI, the jobs are still stuck in running. Also, I after I added a line of code to write the data to S3 (without reading it ...
laggyPC's user avatar
  • 19
0 votes
0 answers
74 views

trino large query is failing due to hive metastore failure

I have a large insert into query, which is reaching 99-100% quite fast (after 5-6 min), then completion percent starting to decrease and back to high number and repeating.. This query is reading from ...
roh's user avatar
  • 133
0 votes
1 answer
41 views

Flink Job Execution Fails with `NoClassDefFoundError` on AWS EMR with Python

I am trying to run a Flink job on an AWS EMR cluster (v7.3.0) using Python 3.9 and Apache Flink with PyFlink. My job reads from an AWS Kinesis stream and prints the stream data to console. However, ...
Mughees Asif's user avatar
0 votes
0 answers
27 views

MSK, EMR, MWAA automation with ansible in AWS

I need to automate creation of MWAA environment, EMR cluster and MSK cluster in AWS as a part of custom environment build for the customer use case. I have ansible tool available for automation ...
saurabh umathe's user avatar
-2 votes
1 answer
61 views

spark-submit using --py-files option could not find path to modules

I am trying to submit a pyspark job in EMR cluster. The code for job lies in a zipped package that is placed in S3 : /bin/spark-submit \ --py-files s3://my-dev/scripts/job-launchers/dev/pipeline.zip ...
Smruti Prakash Mohanty's user avatar
1 vote
0 answers
49 views

Spark EMR executor container failing due to Java heap space

One of my Spark code is failing due to executor container failing due to "java.lang.OutOfMemoryError: Java heap space". Any recommendation is appreciated. I am using emr 200 -r7g16xlarge ...
user3858193's user avatar
  • 1,518
1 vote
0 answers
44 views

Optimizing PySpark Feature Engineering with Over a Billion Rows on EMR

I’m working with a large transaction dataset (~1 billion rows) in PySpark on AWS EMR. My goal is to perform feature engineering where I compute statistics like sum, mean, standard deviation, and ...
Meriiiiii's user avatar
0 votes
0 answers
45 views

CancellationException in Apache Hudi

I have an Apache Hudi job that reads data for one day from S3 and writes it to a Hudi table with clustering enabled on a event_name column. My job was running fine for one month, but then one day ...
Ishan Sanganeria's user avatar
0 votes
0 answers
58 views

Spark Job failed in EMR with exit code 137

The spark job runs in emr with 7.2 version. It's failing with below error. Any advice or tips to debug? Error Job aborted due to stage failure: Task 1518 in stage 14.0 failed 4 times, most recent ...
user3858193's user avatar
  • 1,518
1 vote
0 answers
57 views

Spark Shuffle FetchFailedException for large dataset in emr

My job takes a input data of 400 TB parquet dataset from s3. This job runs with 250 r716x large (each having 64 vCore, 488 GiB memory). The job fails with below error org.apache.spark.shuffle....
user3858193's user avatar
  • 1,518
1 vote
1 answer
64 views

Unable to add logs from my spark jobs to spark eventlogs

I am trying to output logs in spark event logs so that they are accessible in history server. I have tried two approaches Adding my own custom logger that extends Serializable extending org.apache....
Vikas Saxena's user avatar
  • 1,173
0 votes
1 answer
233 views

Where to define "EMR runtime role for cluster" for SageMaker users?

I am encountering this error while attempting to connect to my EMR Serverless Cluster from within the SageMaker Studio Notebook: Select EMR runtime execution role for cluster No available EMR ...
Kermit's user avatar
  • 5,862
0 votes
1 answer
26 views

Same Spark Job executing in Parallel for different days in different instance of AWS EMR have performance issues

While running the spark job (only one instance), it completes in 20-30 mins. However, the same code executes in multiple emr instance in parallel taking more time. Ex: I have 3 instance where each ...
user3858193's user avatar
  • 1,518
1 vote
2 answers
240 views

Amazon EMR 7.2 does not support Ganglia?

When I submit spark job using Amazon EMR 7.2, I got exception: software.amazon.awssdk.services.emr.model.EmrException: Specified application: Ganglia is invalid. (Service: Emr, Status Code: 400, ...
Weever's user avatar
  • 25
0 votes
0 answers
25 views

How to check spark job on Amazon EMR console by job submit with lvy

I can successfully submit spark job through Apache livy to Amazon EMR. But I did not see any steps on Amazon EMR console. Though my spark job was completed and success, I could only check job state ...
Weever's user avatar
  • 25
0 votes
0 answers
20 views

Unexplanined sporadic connection error when getting files from S3 with Spark

Context Between roughly 2024-08-10 20:00 and 2024-08-12 04:00, my half-hourly Scala/Spark EMR job failed numerous times, having not failed once in the several few years. Error in the logs: 24/08/12 02:...
Blair Nangle's user avatar
  • 1,521
0 votes
0 answers
29 views

Alter table rename column on Hudi tables on AWS EMR and AW S3

Using EMR 7.2 [root@ip-172-31-34-128 ~]# spark-sql --jars /usr/lib/hudi/hudi-spark-bundle.jar,/usr/lib/hudi/hudi-aws-bundle.jar --conf "spark.serializer=org.apache.spark.serializer.KryoSerializer&...
Albert T. Wong's user avatar
0 votes
1 answer
45 views

Spark Shell with S3, Hudi and EMR. org.apache.hadoop.fs.s3a.S3AFileSystem not found

Following https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hudi-work-with-dataset.html spark-shell --jars /usr/lib/hudi/hudi-spark-bundle.jar \ --conf "spark.serializer=org.apache.spark....
Albert T. Wong's user avatar
0 votes
1 answer
30 views

Running Spark Shell with Hudi, S3, EMR. Getting not authorized to perform: glue:GetDatabase on resource

the glue database "default" has already been created. scala> (inputDF.write | .format("hudi") | .options(hudiOptions) | .option(DataSourceWriteOptions....
Albert T. Wong's user avatar
1 vote
1 answer
338 views

AQEShuffleRead in Spark Creating few partitions though advisoryPartitionSizeInBytes and initialPartitionNum is provided

I have added the spark.sql.adaptive.advisoryPartitionSizeInBytes=268435456 and spark.sql.adaptive.enabled=true. However my data size for each partition is more than 256 mb. I see the Dag where the ...
user3858193's user avatar
  • 1,518
0 votes
0 answers
148 views

ImportError when importing boto3 in EMR Serverless 7.2.0

I'm deploying an EMR Serverless application and using venv-pack to create the python environment for that application. The venv-pack zip is created inside a Docker container: FROM --platform=linux/...
Viktor Chekhovoi's user avatar
0 votes
1 answer
42 views

unable to run start_notebook_execution for an EMR Jupyter notebook

Why is my AWS EMR Jupyter notebook saying that my cluster id is null? I've verified that 1) the cluster id value being passed is neither null, None, nor '' (empty string) and 2) confirmed that the ...
mwarrior's user avatar
  • 549
0 votes
0 answers
20 views

Epic Emr hyperdrive configuration

I am working as back-end developer in healthcare organization; we are integrating Epic EMR APIs and we need to check patient updated details in epic hyperdrive application. I have installed but don't ...
NorthHill's user avatar
2 votes
3 answers
157 views

Write out the peak memory utilization of a Pyspark Job on EMR to a file

We run a lot of Pyspark jobs on EMR. The pipeline executed is the same, but the inputs can wildly change the peak memory utilization, and that utilization is growing over time. I would like to ...
TexasDev7062's user avatar
1 vote
0 answers
32 views

Why are identical spark jobs taking longer to execute on emr cluster if they are submitted later?

I have an emr cluster that I am submitting 50 jobs to at the same time (about 3 minutes between the first submission and the last submission). I want all the jobs to run in parallel, and I should see ...
Nene Morales's user avatar
0 votes
1 answer
60 views

Unable to access Public APIs using request library in EMR serverless

Getting below error when I am trying to fetch a API using request library. Traceback (most recent call last): File "/tmp/spark-39775710-130a-4403-9182-c557003f351b/lib.zip/urllib3/connection.py&...
Ashwini Kumar's user avatar
0 votes
0 answers
51 views

spark on EMR error when using `foreachBatch`: "terminated with exception: Error while obtaining a new communication channel"

I use spark on EMR with versions: emr-6.13.0, Spark 3.4.1 i try to run a simple spark streaming job that read from kafka and write to memory-table using foreachBatch and get failure "Error while ...
shayms8's user avatar
  • 771
0 votes
0 answers
63 views

EMR Serverless SparkSession builder error: ClassNotFoundException issues

I am trying to create a job in EMR Studio to run in an EMR Serverless application. It's a relatively basic script to use PySpark to read some Athena tables, do some joins, create an output dataframe ...
si1287's user avatar
  • 1
0 votes
0 answers
36 views

Does spark shuffle/exchange converts compress data to uncompress form?

I have input dataset which is 450gb in s3 parquet compressed format. However during exchange it's showing 10 TB. is there any way to tune it. Tow large table are getting joined and and no other ...
user3858193's user avatar
  • 1,518
0 votes
1 answer
78 views

Spark-Scala vs Pyspark Dag is different?

I am converting pyspark job to Scala and jobs executes in emr. The parameter and data and code is same. However I see the run time is different and so also the dag getting created is different. Here I ...
user3858193's user avatar
  • 1,518
0 votes
0 answers
38 views

Apache oozie JA008 error - job state changed from SUCCEDED to FAILED

I'm running oozie HA 5.2.1 on EMR and I have an issue with this temporary directory. I have a workflow which has start node -> action node -> end node. The job start running -> runs for 10-15 ...
Stefan Ss's user avatar
0 votes
0 answers
36 views

AWS EMR - reading multiple "zip" files from S3 bucket returns Your key is too long

In my daily job I use EMR to process large amount of data. This data are stored in CSV files on S3 bucket. The idea I had was to try to process ziped csv files instead of plain csv. In Hive app I use ...
Vape's user avatar
  • 131
0 votes
0 answers
27 views

Airflow error while creating EMR cluster via DAG

I am looking to create an EMR cluster via airflow DAG using EmrCreateJobFlowOperator using a role called dev-emr-ec2-profile-role for jobFlow. This role is used to provision EMR cluster via Terraform ...
Anngva82's user avatar
5 votes
0 answers
100 views

Spark Repartition/shuffle optimization

I am trying to repartiton before applying any transformation logic. This takes a lot of time. Here is code and snapshot of UI below. Any optimization can be applied here?. Cluster: AWS EMR,200 Task ...
user3858193's user avatar
  • 1,518
1 vote
0 answers
108 views

Spark EMR Shuffle Read Fetch Wait Time is in 4hrs

One of my spark job failed due emr-spark-shuffle-fetchfailedexception-with-65tb-data-with-aqe-enabled has high Shuffle Read Fetch Wait Time. is there any way it can be improved. Spark-submit spark-...
user3858193's user avatar
  • 1,518
0 votes
0 answers
89 views

Troubleshooting Kafka Integration with Spark Streaming on Amazon EMR Serverless

Objective: To set up a streaming job on Amazon EMR Serverless to process weather data from Amazon MSK (Managed Streaming for Apache Kafka) and write the word count results to an S3 bucket. Steps Taken:...
user26129742's user avatar
0 votes
1 answer
118 views

EMR-Spark Job creating max 1000 partitions/task when AQE is enabled

I see always 1000 task/partitions getting created for a spark jobs with AQE enabled. If I execute job for monthly(4 times weekly data) or a week data, the shuffle partitions are same.Whis is nothing ...
user3858193's user avatar
  • 1,518
0 votes
2 answers
135 views

What does retry in SparkUI means?

I have spark executed in two different instances: spark.sql.adaptive.coalescePartitions.enabled=false spark.sql.adaptive.coalescePartitions.enabled=true In the first instance, the stage graphs have ...
user3858193's user avatar
  • 1,518
0 votes
0 answers
35 views

ClassCastException in Spark SQL Incremental Load with DBT

I'm encountering a ClassCastException error when running an incremental load using DBT and Spark SQL. The error message indicates an issue with casting in the Spark execution plan: org.apache.hive....
Raul Zinezi's user avatar
1 vote
1 answer
83 views

Spark emr jobs: Is the number of task defined by AQE (adaptive.enabled)?

I see the number of task in spark job is only 1000 after initial read, where as number of cores available is 9000 (1800 executors*5 core each). I have enabled aqe and coalesce shuffle partition. In ...
user3858193's user avatar
  • 1,518
0 votes
1 answer
277 views

How to enable "Use for Hive table metadata" in "AWS Glue Data Catalog settings" using Terraform?

I am using Terraform to set up Trino cluster managed by Amazon EMR. Here is my Terraform code: resource "aws_emr_cluster" "hm_amazon_emr_cluster" { name ...
Hongbo Miao's user avatar
  • 49.5k
0 votes
0 answers
106 views

Spark EMR long running transformation job GC is taking more time

I have transformation job running in spark emr (7.1). This job is compute intensive as it calculates approx_percentile and other aggregated function. The GC is taking more time. I would like to know ...
user3858193's user avatar
  • 1,518
0 votes
0 answers
63 views

AWS EMR Jupyterhub notebook run fails with error: Session isn't active

I noticed this question is asked a lot but the solution proposed in other threads do not work for me. I have created an EMR cluster on AWS and I am running a quite time intensive code. The notebook ...
Francesco Pegoraro's user avatar
2 votes
1 answer
133 views

Spark aggregate on multiple columns or a hash

Supposed I want to drop duplicates or perform an aggregation on 3 columns in my Spark dataframe. Would it be more optimal to do df = df.withColumn( "hash_dup", f.hash( ...
bmcristi's user avatar
  • 103

1
2 3 4 5
69