Skip to main content

All Questions

Tagged with
Filter by
Sorted by
Tagged with
0 votes
0 answers
45 views

CancellationException in Apache Hudi

I have an Apache Hudi job that reads data for one day from S3 and writes it to a Hudi table with clustering enabled on a event_name column. My job was running fine for one month, but then one day ...
Ishan Sanganeria's user avatar
0 votes
1 answer
78 views

Spark-Scala vs Pyspark Dag is different?

I am converting pyspark job to Scala and jobs executes in emr. The parameter and data and code is same. However I see the run time is different and so also the dag getting created is different. Here I ...
user3858193's user avatar
  • 1,518
0 votes
0 answers
42 views

Spark SQL - performance degradation after adding a new column

My code is in Scala and I'm using Spark SQL syntax to make a union between 3 dataframes of data. Currently I am working on adding a new field. It's applicable only for one of the dataframes, so the ...
nelyanne's user avatar
  • 106
1 vote
0 answers
283 views

EMR - Spark version upgrade from 2.4.0 to 3.1.2 causes write issues in AWS Open Search (Elasticsearch 6.7)

I'm currently using EMR release version [emr.5.23.0] i.e. Spark 2.4.0 & Spark Elastic Search Connector [elasticsearch-spark-30_2.12-7.12.0.jar] to write data into AWS Open Search (Elasticsearch 6....
Robin's user avatar
  • 129
0 votes
1 answer
98 views

Multiple CSV Scan in spark structured streaming for same file in s3

I have a spark 3.4.1 structured streaming job which uses custom s3-sqs connector by AWS which reads messages from sqs, reads the provided path in SQS Message and then read from S3. Now I need to split ...
Umang Desai's user avatar
2 votes
0 answers
470 views

Spark fails to reconfigure log4j2 when using plug-ins bundled in fat jars

After upgrading Spark code base from 3.2 to 3.3+, porting Log4j1 to Log4j2 code, we find that Spark doesn't let us reconfigure Log4j programmatically like we were able to do before. I use Maven shade ...
jeanfrancis's user avatar
0 votes
1 answer
34 views

How to pass the list elements present within map in scala?

I'm currently learning scala-spark so please bear with me. I'm trying to apply a function to scala dataframe to create a new column as below - import org.apache.spark.sql.functions._ import org.apache....
djm's user avatar
  • 45
1 vote
0 answers
325 views

Not able to write to AWS Glue catalog metastore from spark jobs running on EMR

writing a simple spark job running on EMR to create a table stored in Glue catalog but it fails to recognize the glue catalog databases and writing to spark default metastore. EMR Configurations:- ...
karthik's user avatar
  • 36
0 votes
1 answer
1k views

Got "Class software.amazon.msk.auth.iam.IAMClientCallbackHandler could not be found" in Amazon EMR

I have a Spark application. Here is the code sinking to Amazon MSK val query = df.writeStream .format("kafka") .option( "kafka.bootstrap.servers", &...
Hongbo Miao's user avatar
  • 49.5k
0 votes
1 answer
4k views

Got "java.lang.ClassNotFoundException: org.apache.spark.sql.catalyst.FileSourceOptions$" when spark-submit to Amazon EMR

I have a Spark application. My build.sbt looks like name := "IngestFromS3ToKafka" version := "1.0" scalaVersion := "2.12.17" resolvers += "confluent" at "...
Hongbo Miao's user avatar
  • 49.5k
0 votes
0 answers
513 views

java.io.FileNotFoundException: (Permission denied) while renaming the file within Spark application

Getting an exception when trying to rename a file within Spark application. Permission denied - new file name. The same thing works good with the spark-shell with by the same user. P.S. The path is ...
user avatar
0 votes
1 answer
37 views

Conditionally filtering parquets based on parameters provided to function

I have a set of partitioned parquets I'm attempting to read in Spark. To simplify filtering, I've written a wrapper function to optionally allow filtering based on the parquets' partition columns. The ...
maxwellray's user avatar
0 votes
1 answer
162 views

Submitting Multiple Jobs in Sequence

I'm having some trouble understanding how Spark allows for scheduling of jobs. I have a series of jobs I'd like to run in sequence. From what I've read, I can submit any number of jobs to spark-submit ...
maxwellray's user avatar
0 votes
1 answer
473 views

Class org.apache-spark.SparkException, java.lang.NoClassDefFoundError: Could not initialize class XXX

Class org.apache-spark.SparkException, java.lang.NoClassDefFoundError: Could not initialize class XXX(class where field validation exists) Exception when I am trying to do field validations on Spark ...
Dheeraj Garikapati's user avatar
0 votes
0 answers
445 views

Implementing exponential backoff in spark between consecutive attempts

On our AWS EMR spark cluster we have 100-150 jobs running at a given time. Each cluster-task-node is allocated 7 of 8 cores. But at some hour in a day the load average touches 10 --causing task ...
chendu's user avatar
  • 759
0 votes
0 answers
273 views

Issue when running the job in EMR 6.6.0 using Spark 3.2.0

My job(Spark-Scala) is running fine in emr-5.36.0 using spark 2.4.8 and scala 2.11. But I am getting below error while running the same job in emr-6.6.0 using spark 3.2.0 and scala 2.12.15 Exception ...
Ankit Gupta's user avatar
0 votes
1 answer
32 views

How can I save file name in the tuples in scala

I have folder which contains many text files, I have to read this files in one RDD and save the file name with words on it example : doc1.txt : " hello my name sam " doc2.txt : "hello ...
Sam Sam's user avatar
  • 41
1 vote
1 answer
414 views

java.lang.VerifyError: Operand stack overflow

Was trying to connect to grpc from spark. Its working fine on my local but while testing it in AWS EMR ( after doing sbt assembly)- got conflict with spark package in emr,so shaded the libraries which ...
Sudipta Kar's user avatar
2 votes
0 answers
154 views

Persistence & Serialization of Scala case-classes on a cluster

I am looking for a robust, backwards-compatible, serialization & persistence solution for my case-classes. General my project is written in Scala. It contains various case-classes that need to be ...
OrenKov's user avatar
  • 23
0 votes
0 answers
115 views

Failed to Save XGBoost Model in EMR 6.7

I am trying to train an Xgboost model in EMR 6.7 and save the model to S3 (Xgboost version is 1.3.1 and Spark version is 3.2.1). But when saving the model this error occurs: java.lang....
canye's user avatar
  • 1
0 votes
0 answers
34 views

RDD collect OOM issues

I am trying to parse the input rdd into map with broadcast but running into memory issues on EMR. Below is the code val data = sparkContext.textFile("inputPath") val result = sparkContext....
user0712's user avatar
0 votes
1 answer
208 views

java.lang.NoSuchMethodError: scala.Product.$init$(Lscala/Product;)V - Flink on EMR

I am trying to run a Flink (v 1.13.1) application on EMR ( v 5.34.0). My Flink application uses Scallop(v 4.1.0) to parse the arguments passed. Scala version used for Flink application is 2.12.7. I ...
Amit's user avatar
  • 1,121
0 votes
0 answers
209 views

Failed to load class error due to seemingly fine code

I have been intermittently receiving the "Failed to load class [main class]" error due to seemingly working code and would like to understand why? [ec2-user@ip-xxx localtesting]$ spark-...
user9853453's user avatar
0 votes
1 answer
391 views

Why did optimizing my Spark config slow it down?

I'm running Spark 2.2 (legacy codebase constraints) on an AWS EMR Cluster (version 5.8), and run a system of both Spark and Hadoop jobs on a cluster daily. I noticed that the configs that are ...
NateH06's user avatar
  • 3,564
1 vote
0 answers
1k views

Scala - java.lang.IllegalArgumentException: requirement failed: All input types must be the same except

Here I'm trying to add date to the data frame(232) but I'm getting exception on line number 234: 232 - val df5Final = df5Phone.withColumn(colName, regexp_replace(col(colName), date_pattern, "...
mohit's user avatar
  • 13
1 vote
2 answers
449 views

java.lang.VerifyError: Operand stack overflow for google-ads API and SBT

I am trying to migrate from Google-AdWords to google-ads-v10 API in spark 3.1.1 in EMR. I am facing some dependency issues due to conflicts with existing jars. Initially, we were facing a dependency ...
blueJupiter16's user avatar
3 votes
0 answers
2k views

Spark buffer holder size limit issue

I am doing the aggregate function on column level like df.groupby("a").agg(collect_set(b)) The column value is increasing beyond default size of 2gb. Error details: Spark job fails with an ...
lsha's user avatar
  • 31
0 votes
1 answer
419 views

java.lang.ArrayIndexOutOfBoundsException: 1 while saving data frame in spark Scala

In EMR, we are using Salesforce Bulk API call to fetch records from salesforce object. For one of the Object(TASK) data frame while saving to parquet getting below error. java.lang....
Kalpana's user avatar
1 vote
0 answers
198 views

AWS EMR - Spark-shell \ Spark submit - Unrecognized VM option

I am getting below exception while invoking spark-shell from AWS emr Unrecognized VM option 'CMSClassUnloaadingEnabled' Did you mean '(+/-)CMSClassUnloadingEnabled'? Error: Could not create the Java ...
Learn Hadoop's user avatar
  • 3,040
0 votes
0 answers
1k views

Spark scala config for S3 config --request-payer requester

I am writing Spark job to access S3 file in parquet format. The bucket owner has enabled request-payer config. I can access the S3 file using below cli command (without request-payer it gives access ...
MIK's user avatar
  • 402
1 vote
1 answer
455 views

Developing a multi-file Scala package in Spark EMR notebook

I'm basically looking for a way to do Spark-based scala development in EMR. So I'd have a couple of project files on the hadoop cluster: // mypackage.scala package mypackage <Spark-dependent scala ...
Jeff's user avatar
  • 70
1 vote
1 answer
3k views

org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "s3"

I have this piece of code: import org.apache.hadoop.fs.Path import org.apache.hadoop.conf.Configuration new Path("s3://bucket/key").getFileSystem(new Configuration()) When I run it locally ...
Prassi's user avatar
  • 107
1 vote
0 answers
851 views

How to Run Apache Tika on Apache Spark

I am trying to run Apache Tika on Apache Spark on AWS EMR to perform distributed text extraction on a large collection of documents. I have built the Tika JAR with shaded dependencies as explained in ...
mkreisel's user avatar
  • 205
0 votes
1 answer
101 views

TensorFlowException: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for /mnt/yarn/usercache

I am trying to run "onto_electra_base_uncased" model on some data stored in hive table, I ran count() on df before saving the data into hive table and got this exception. Spark Shell launch ...
mohit's user avatar
  • 13
0 votes
0 answers
119 views

Glue table behaving differently for EMR than Athena

I have a table set up on glue, meta store. It points to a bunch of constantly updating parquet files on s3, inside a bucket with todays date on it. There is a new folder for each day. If I am to query ...
JetS79's user avatar
  • 71
1 vote
1 answer
2k views

Spark on AWS EMR: use more executors

Long story short: How can I increase the number of executors in Spark on EMR? Short story long: I am running a pure compute scala spark job (Monte Carlo method to estimate Pi) on EMR 6.3 (Spark 3.1.1)....
Guillaume's user avatar
  • 2,819
0 votes
1 answer
510 views

Spark EMR job jackson error - java.lang.NoSuchMethodError - AnnotatedMember.getType()Lcom/fasterxml/jackson/databind/JavaType

I know we have similar questions already answered here. but for some reason none of the options are working for me. Below is the error i'm getting: User class threw exception: java.lang....
Rajashree Gr's user avatar
0 votes
1 answer
885 views

Add a new line in front of each line before writing to JSON format using Spark in Scala

I'd like to add one new line in front of each of my json document before Spark writes it into my s3 bucket: df.createOrReplaceTempView("ParquetTable") val parkSQL = spark.sql("select ...
Fisher Coder's user avatar
  • 3,566
-1 votes
1 answer
280 views

How can I add 2 (pyspark,scala) steps together in AWS EMR?

I want to add two steps together in AWS EMR cluster. Step 1 is a pyspark based code, Step 2 is Scala-spark based code. How do I achieve this?
user14885011's user avatar
1 vote
1 answer
840 views

Stream data from flink to S3

I am using Flink on amazon EMR and want to stream results of my pipeline to s3 bucket. I am using Flink version => 1.11.2 This is a code snippet of how the code looks right now variable : val ...
Tom Adedeji's user avatar
1 vote
1 answer
480 views

Spark Kryo deserialization of EMR-produced files fails locally

Upon upgrading EMR version to 6.2.0 (we previously used 5.0 beta - ish) and Spark 3.0.1, we noticed that we were unable to locally read Kryo files written from EMR clusters (this was obviously ...
Gilad Ber's user avatar
  • 126
0 votes
2 answers
190 views

How to run transformation in parallel spark

I am trying to read text.gz file, repartition it and do some transformation, however when I see the DAG the stag1 is reading the data and doing the transformation on only 1 task and it is taking time ...
Deepak Poojari's user avatar
1 vote
1 answer
734 views

Set maximizeResourceAllocation=true at EMR cluster level programatically in spark Scala

I'm trying to find a way to set the maximizeResourceAllocation=true property at the EMR cluster level in spark scala. I used --conf maximizeResourceAllocation=true argument with the spark-submit ...
Rajashree Gr's user avatar
0 votes
0 answers
73 views

How to make sure spark job is Utilizing all the available resource (utilizing all the containers)

I have a spark streaming job that reads data from kafka and puts into a Datawarehouse. It is running perfectly fine but point of concerns is when I monitored the logs, I noticed that logs from my ...
mt_leo's user avatar
  • 67
1 vote
0 answers
453 views

Spark with AWS Elastic Search integration - Getting - Cloud instance without the proper setting 'es.nodes.wan.only' error

Create AWS Elastic search with public access and try to push dataframe into ES from my local , i am getting below exception Exception in thread "main" org.elasticsearch.hadoop....
Learn Hadoop's user avatar
  • 3,040
0 votes
0 answers
107 views

Where is output file created after running FileWriter in AWS EMR

This is how I am writing to file. (Scala code) import java.io.FileWriter val fw = new FileWriter("my_output_filename.txt", true) fw.write("something to write into output file") fw....
Sai Charan Nivarthi's user avatar
1 vote
1 answer
477 views

Running AWS EMR Spark: java.io.FileNotFoundException ... io.netty_netty-transport-native-epoll-4.1.59.Final.jar does not exist

When trying to run a Spark 3.0.1 job on AWS emr-6.1.0, the following error occurs: Exception in thread "main" java.io.FileNotFoundException: File file:/home/hadoop/.ivy2/jars/io.netty_netty-...
Jeremy's user avatar
  • 1,914
1 vote
0 answers
729 views

How to fix java.lang.NoSuchMethodError on AWS EMR/Spark?

I am running a Spark project on AWS/EMR which includes the following dependency: "com.google.apis" % "google-api-services-analyticsreporting" % "v4-rev20210106-1.31.0" &...
Randomize's user avatar
  • 9,093
0 votes
1 answer
1k views

How to use Spark-Submit to run a scala file present on EMR cluster's master node?

So, I connect to my EMR cluster's master node using SSH. This is the file structure present in the master node: |-- AnalysisRunner.scala |-- AutomatedConstraints.scala |-- deequ-1.0.1.jar |-- new | |...
Debapratim Chakraborty's user avatar
0 votes
2 answers
219 views

spark scala 'take(10)' operation is taking too long

I got the following code within my super-simple Spark Scala application: ... val t3 = System.currentTimeMillis println("VertexRDD created in " + (t3 - t2) + " ms") ...
pnet_fabric's user avatar

1
2 3 4 5