Newest 'amazon-emr+scala' Questions

0 votes

0 answers

45 views

CancellationException in Apache Hudi

I have an Apache Hudi job that reads data for one day from S3 and writes it to a Hudi table with clustering enabled on a event_name column. My job was running fine for one month, but then one day ...

Ishan Sanganeria

355

asked Sep 21 at 17:40

0 votes

1 answer

78 views

Spark-Scala vs Pyspark Dag is different?

I am converting pyspark job to Scala and jobs executes in emr. The parameter and data and code is same. However I see the run time is different and so also the dag getting created is different. Here I ...

user3858193

1,518

asked Jul 19 at 17:39

0 votes

0 answers

42 views

Spark SQL - performance degradation after adding a new column

My code is in Scala and I'm using Spark SQL syntax to make a union between 3 dataframes of data. Currently I am working on adding a new field. It's applicable only for one of the dataframes, so the ...

nelyanne

106

asked Apr 18 at 17:54

1 vote

0 answers

283 views

EMR - Spark version upgrade from 2.4.0 to 3.1.2 causes write issues in AWS Open Search (Elasticsearch 6.7)

I'm currently using EMR release version [emr.5.23.0] i.e. Spark 2.4.0 & Spark Elastic Search Connector [elasticsearch-spark-30_2.12-7.12.0.jar] to write data into AWS Open Search (Elasticsearch 6....

Robin

129

asked Dec 13, 2023 at 18:00

0 votes

1 answer

98 views

Multiple CSV Scan in spark structured streaming for same file in s3

I have a spark 3.4.1 structured streaming job which uses custom s3-sqs connector by AWS which reads messages from sqs, reads the provided path in SQS Message and then read from S3. Now I need to split ...

Umang Desai

45

asked Nov 17, 2023 at 14:00

2 votes

0 answers

470 views

Spark fails to reconfigure log4j2 when using plug-ins bundled in fat jars

After upgrading Spark code base from 3.2 to 3.3+, porting Log4j1 to Log4j2 code, we find that Spark doesn't let us reconfigure Log4j programmatically like we were able to do before. I use Maven shade ...

jeanfrancis

113

asked Sep 27, 2023 at 19:19

0 votes

1 answer

34 views

How to pass the list elements present within map in scala?

I'm currently learning scala-spark so please bear with me. I'm trying to apply a function to scala dataframe to create a new column as below - import org.apache.spark.sql.functions._ import org.apache....

djm

45

asked Aug 17, 2023 at 13:49

1 vote

0 answers

325 views

Not able to write to AWS Glue catalog metastore from spark jobs running on EMR

writing a simple spark job running on EMR to create a table stored in Glue catalog but it fails to recognize the glue catalog databases and writing to spark default metastore. EMR Configurations:- ...

karthik

36

asked Jun 29, 2023 at 14:41

0 votes

1 answer

1k views

Got "Class software.amazon.msk.auth.iam.IAMClientCallbackHandler could not be found" in Amazon EMR

I have a Spark application. Here is the code sinking to Amazon MSK val query = df.writeStream .format("kafka") .option( "kafka.bootstrap.servers", &...

Hongbo Miao

49.5k

asked Jun 1, 2023 at 20:56

0 votes

1 answer

4k views

Got "java.lang.ClassNotFoundException: org.apache.spark.sql.catalyst.FileSourceOptions$" when spark-submit to Amazon EMR

I have a Spark application. My build.sbt looks like name := "IngestFromS3ToKafka" version := "1.0" scalaVersion := "2.12.17" resolvers += "confluent" at "...

Hongbo Miao

49.5k

asked May 31, 2023 at 23:14

0 votes

0 answers

513 views

java.io.FileNotFoundException: (Permission denied) while renaming the file within Spark application

Getting an exception when trying to rename a file within Spark application. Permission denied - new file name. The same thing works good with the spark-shell with by the same user. P.S. The path is ...

user21555659

asked Apr 13, 2023 at 13:32

0 votes

1 answer

37 views

Conditionally filtering parquets based on parameters provided to function

I have a set of partitioned parquets I'm attempting to read in Spark. To simplify filtering, I've written a wrapper function to optionally allow filtering based on the parquets' partition columns. The ...

maxwellray

121

asked Apr 4, 2023 at 14:47

0 votes

1 answer

162 views

Submitting Multiple Jobs in Sequence

I'm having some trouble understanding how Spark allows for scheduling of jobs. I have a series of jobs I'd like to run in sequence. From what I've read, I can submit any number of jobs to spark-submit ...

maxwellray

121

asked Feb 25, 2023 at 6:45

0 votes

1 answer

473 views

Class org.apache-spark.SparkException, java.lang.NoClassDefFoundError: Could not initialize class XXX

Class org.apache-spark.SparkException, java.lang.NoClassDefFoundError: Could not initialize class XXX(class where field validation exists) Exception when I am trying to do field validations on Spark ...

Dheeraj Garikapati

1

asked Feb 6, 2023 at 21:13

0 votes

0 answers

445 views

Implementing exponential backoff in spark between consecutive attempts

On our AWS EMR spark cluster we have 100-150 jobs running at a given time. Each cluster-task-node is allocated 7 of 8 cores. But at some hour in a day the load average touches 10 --causing task ...

chendu

759

asked Jan 12, 2023 at 11:08

0 votes

0 answers

273 views

Issue when running the job in EMR 6.6.0 using Spark 3.2.0

My job(Spark-Scala) is running fine in emr-5.36.0 using spark 2.4.8 and scala 2.11. But I am getting below error while running the same job in emr-6.6.0 using spark 3.2.0 and scala 2.12.15 Exception ...

Ankit Gupta

11

asked Dec 27, 2022 at 19:36

0 votes

1 answer

32 views

How can I save file name in the tuples in scala

I have folder which contains many text files, I have to read this files in one RDD and save the file name with words on it example : doc1.txt : " hello my name sam " doc2.txt : "hello ...

Sam Sam

41

asked Oct 12, 2022 at 14:53

1 vote

1 answer

414 views

java.lang.VerifyError: Operand stack overflow

Was trying to connect to grpc from spark. Its working fine on my local but while testing it in AWS EMR ( after doing sbt assembly)- got conflict with spark package in emr,so shaded the libraries which ...

Sudipta Kar

11

asked Sep 28, 2022 at 14:12

2 votes

0 answers

154 views

Persistence & Serialization of Scala case-classes on a cluster

I am looking for a robust, backwards-compatible, serialization & persistence solution for my case-classes. General my project is written in Scala. It contains various case-classes that need to be ...

OrenKov

23

asked Sep 13, 2022 at 7:04

0 votes

0 answers

115 views

Failed to Save XGBoost Model in EMR 6.7

I am trying to train an Xgboost model in EMR 6.7 and save the model to S3 (Xgboost version is 1.3.1 and Spark version is 3.2.1). But when saving the model this error occurs: java.lang....

canye

1

asked Sep 1, 2022 at 21:17

0 votes

0 answers

34 views

RDD collect OOM issues

I am trying to parse the input rdd into map with broadcast but running into memory issues on EMR. Below is the code val data = sparkContext.textFile("inputPath") val result = sparkContext....

user0712

43

asked Aug 30, 2022 at 4:27

0 votes

1 answer

208 views

java.lang.NoSuchMethodError: scala.Product.$init$(Lscala/Product;)V - Flink on EMR

I am trying to run a Flink (v 1.13.1) application on EMR ( v 5.34.0). My Flink application uses Scallop(v 4.1.0) to parse the arguments passed. Scala version used for Flink application is 2.12.7. I ...

Amit

1,121

asked Jul 11, 2022 at 16:36

0 votes

0 answers

209 views

Failed to load class error due to seemingly fine code

I have been intermittently receiving the "Failed to load class [main class]" error due to seemingly working code and would like to understand why? [ec2-user@ip-xxx localtesting]$ spark-...

user9853453

1

asked Jun 7, 2022 at 21:09

0 votes

1 answer

391 views

Why did optimizing my Spark config slow it down?

I'm running Spark 2.2 (legacy codebase constraints) on an AWS EMR Cluster (version 5.8), and run a system of both Spark and Hadoop jobs on a cluster daily. I noticed that the configs that are ...

NateH06

3,564

asked Apr 7, 2022 at 19:23

1 vote

0 answers

1k views

Scala - java.lang.IllegalArgumentException: requirement failed: All input types must be the same except

Here I'm trying to add date to the data frame(232) but I'm getting exception on line number 234: 232 - val df5Final = df5Phone.withColumn(colName, regexp_replace(col(colName), date_pattern, "...

mohit

13

asked Mar 31, 2022 at 16:57

1 vote

2 answers

449 views

java.lang.VerifyError: Operand stack overflow for google-ads API and SBT

I am trying to migrate from Google-AdWords to google-ads-v10 API in spark 3.1.1 in EMR. I am facing some dependency issues due to conflicts with existing jars. Initially, we were facing a dependency ...

blueJupiter16

51

asked Mar 2, 2022 at 12:43

3 votes

0 answers

2k views

Spark buffer holder size limit issue

I am doing the aggregate function on column level like df.groupby("a").agg(collect_set(b)) The column value is increasing beyond default size of 2gb. Error details: Spark job fails with an ...

lsha

31

asked Dec 31, 2021 at 0:13

0 votes

1 answer

419 views

java.lang.ArrayIndexOutOfBoundsException: 1 while saving data frame in spark Scala

In EMR, we are using Salesforce Bulk API call to fetch records from salesforce object. For one of the Object(TASK) data frame while saving to parquet getting below error. java.lang....

Kalpana

3

asked Nov 23, 2021 at 12:34

1 vote

0 answers

198 views

AWS EMR - Spark-shell \ Spark submit - Unrecognized VM option

I am getting below exception while invoking spark-shell from AWS emr Unrecognized VM option 'CMSClassUnloaadingEnabled' Did you mean '(+/-)CMSClassUnloadingEnabled'? Error: Could not create the Java ...

Learn Hadoop

3,040

asked Nov 15, 2021 at 15:32

0 votes

0 answers

1k views

Spark scala config for S3 config --request-payer requester

I am writing Spark job to access S3 file in parquet format. The bucket owner has enabled request-payer config. I can access the S3 file using below cli command (without request-payer it gives access ...

MIK

402

asked Oct 29, 2021 at 18:46

1 vote

1 answer

455 views

Developing a multi-file Scala package in Spark EMR notebook

I'm basically looking for a way to do Spark-based scala development in EMR. So I'd have a couple of project files on the hadoop cluster: // mypackage.scala package mypackage <Spark-dependent scala ...

Jeff

70

asked Oct 6, 2021 at 15:31

1 vote

1 answer

3k views

org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "s3"

I have this piece of code: import org.apache.hadoop.fs.Path import org.apache.hadoop.conf.Configuration new Path("s3://bucket/key").getFileSystem(new Configuration()) When I run it locally ...

Prassi

107

asked Sep 21, 2021 at 13:31

1 vote

0 answers

851 views

How to Run Apache Tika on Apache Spark

I am trying to run Apache Tika on Apache Spark on AWS EMR to perform distributed text extraction on a large collection of documents. I have built the Tika JAR with shaded dependencies as explained in ...

mkreisel

205

asked Sep 3, 2021 at 14:43

0 votes

1 answer

101 views

TensorFlowException: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for /mnt/yarn/usercache

I am trying to run "onto_electra_base_uncased" model on some data stored in hive table, I ran count() on df before saving the data into hive table and got this exception. Spark Shell launch ...

mohit

13

asked Aug 31, 2021 at 11:46

0 votes

0 answers

119 views

Glue table behaving differently for EMR than Athena

I have a table set up on glue, meta store. It points to a bunch of constantly updating parquet files on s3, inside a bucket with todays date on it. There is a new folder for each day. If I am to query ...

JetS79

71

asked Aug 4, 2021 at 14:05

1 vote

1 answer

2k views

Spark on AWS EMR: use more executors

Long story short: How can I increase the number of executors in Spark on EMR? Short story long: I am running a pure compute scala spark job (Monte Carlo method to estimate Pi) on EMR 6.3 (Spark 3.1.1)....

Guillaume

2,819

asked Jul 20, 2021 at 10:28

0 votes

1 answer

510 views

Spark EMR job jackson error - java.lang.NoSuchMethodError - AnnotatedMember.getType()Lcom/fasterxml/jackson/databind/JavaType

I know we have similar questions already answered here. but for some reason none of the options are working for me. Below is the error i'm getting: User class threw exception: java.lang....

Rajashree Gr

589

asked Jun 1, 2021 at 19:00

0 votes

1 answer

885 views

Add a new line in front of each line before writing to JSON format using Spark in Scala

I'd like to add one new line in front of each of my json document before Spark writes it into my s3 bucket: df.createOrReplaceTempView("ParquetTable") val parkSQL = spark.sql("select ...

Fisher Coder

3,566

asked May 22, 2021 at 15:07

-1 votes

1 answer

280 views

How can I add 2 (pyspark,scala) steps together in AWS EMR?

I want to add two steps together in AWS EMR cluster. Step 1 is a pyspark based code, Step 2 is Scala-spark based code. How do I achieve this?

user14885011

1

asked May 18, 2021 at 18:09

1 vote

1 answer

840 views

Stream data from flink to S3

I am using Flink on amazon EMR and want to stream results of my pipeline to s3 bucket. I am using Flink version => 1.11.2 This is a code snippet of how the code looks right now variable : val ...

Tom Adedeji

23

asked May 10, 2021 at 16:47

1 vote

1 answer

480 views

Spark Kryo deserialization of EMR-produced files fails locally

Upon upgrading EMR version to 6.2.0 (we previously used 5.0 beta - ish) and Spark 3.0.1, we noticed that we were unable to locally read Kryo files written from EMR clusters (this was obviously ...

Gilad Ber

126

asked May 10, 2021 at 15:02

0 votes

2 answers

190 views

How to run transformation in parallel spark

I am trying to read text.gz file, repartition it and do some transformation, however when I see the DAG the stag1 is reading the data and doing the transformation on only 1 task and it is taking time ...

Deepak Poojari

756

asked Apr 30, 2021 at 15:22

1 vote

1 answer

734 views

Set maximizeResourceAllocation=true at EMR cluster level programatically in spark Scala

I'm trying to find a way to set the maximizeResourceAllocation=true property at the EMR cluster level in spark scala. I used --conf maximizeResourceAllocation=true argument with the spark-submit ...

Rajashree Gr

589

asked Apr 9, 2021 at 14:16

0 votes

0 answers

73 views

How to make sure spark job is Utilizing all the available resource (utilizing all the containers)

I have a spark streaming job that reads data from kafka and puts into a Datawarehouse. It is running perfectly fine but point of concerns is when I monitored the logs, I noticed that logs from my ...

mt_leo

67

asked Mar 25, 2021 at 17:34

1 vote

0 answers

453 views

Spark with AWS Elastic Search integration - Getting - Cloud instance without the proper setting 'es.nodes.wan.only' error

Create AWS Elastic search with public access and try to push dataframe into ES from my local , i am getting below exception Exception in thread "main" org.elasticsearch.hadoop....

Learn Hadoop

3,040

asked Mar 8, 2021 at 17:07

0 votes

0 answers

107 views

Where is output file created after running FileWriter in AWS EMR

This is how I am writing to file. (Scala code) import java.io.FileWriter val fw = new FileWriter("my_output_filename.txt", true) fw.write("something to write into output file") fw....

Sai Charan Nivarthi

11

asked Mar 7, 2021 at 10:55

1 vote

1 answer

477 views

Running AWS EMR Spark: java.io.FileNotFoundException ... io.netty_netty-transport-native-epoll-4.1.59.Final.jar does not exist

When trying to run a Spark 3.0.1 job on AWS emr-6.1.0, the following error occurs: Exception in thread "main" java.io.FileNotFoundException: File file:/home/hadoop/.ivy2/jars/io.netty_netty-...

Jeremy

1,914

asked Mar 5, 2021 at 12:26

1 vote

0 answers

729 views

How to fix java.lang.NoSuchMethodError on AWS EMR/Spark?

I am running a Spark project on AWS/EMR which includes the following dependency: "com.google.apis" % "google-api-services-analyticsreporting" % "v4-rev20210106-1.31.0" &...

Randomize

9,093

asked Mar 2, 2021 at 20:27

0 votes

1 answer

1k views

How to use Spark-Submit to run a scala file present on EMR cluster's master node?

Debapratim Chakraborty

417

asked Feb 15, 2021 at 14:22

0 votes

2 answers

219 views

spark scala 'take(10)' operation is taking too long

I got the following code within my super-simple Spark Scala application: ... val t3 = System.currentTimeMillis println("VertexRDD created in " + (t3 - t2) + " ms") ...

pnet_fabric

89

asked Feb 12, 2021 at 19:56

Collectives™ on Stack Overflow

All Questions

Related Tags