All Questions
Tagged with amazon-emr scala
238 questions
0
votes
0
answers
45
views
CancellationException in Apache Hudi
I have an Apache Hudi job that reads data for one day from S3 and writes it to a Hudi table with clustering enabled on a event_name column. My job was running fine for one month, but then one day ...
0
votes
1
answer
78
views
Spark-Scala vs Pyspark Dag is different?
I am converting pyspark job to Scala and jobs executes in emr. The parameter and data and code is same. However I see the run time is different and so also the dag getting created is different. Here I ...
0
votes
0
answers
42
views
Spark SQL - performance degradation after adding a new column
My code is in Scala and I'm using Spark SQL syntax to make a union between 3 dataframes of data.
Currently I am working on adding a new field. It's applicable only for one of the dataframes, so the ...
1
vote
0
answers
283
views
EMR - Spark version upgrade from 2.4.0 to 3.1.2 causes write issues in AWS Open Search (Elasticsearch 6.7)
I'm currently using EMR release version [emr.5.23.0] i.e. Spark 2.4.0 & Spark Elastic Search Connector [elasticsearch-spark-30_2.12-7.12.0.jar] to write data into AWS Open Search (Elasticsearch 6....
0
votes
1
answer
98
views
Multiple CSV Scan in spark structured streaming for same file in s3
I have a spark 3.4.1 structured streaming job which uses custom s3-sqs connector by AWS which reads messages from sqs, reads the provided path in SQS Message and then read from S3. Now I need to split ...
2
votes
0
answers
470
views
Spark fails to reconfigure log4j2 when using plug-ins bundled in fat jars
After upgrading Spark code base from 3.2 to 3.3+, porting Log4j1 to Log4j2 code, we find that Spark doesn't let us reconfigure Log4j programmatically like we were able to do before.
I use Maven shade ...
0
votes
1
answer
34
views
How to pass the list elements present within map in scala?
I'm currently learning scala-spark so please bear with me.
I'm trying to apply a function to scala dataframe to create a new column as below -
import org.apache.spark.sql.functions._
import org.apache....
1
vote
0
answers
325
views
Not able to write to AWS Glue catalog metastore from spark jobs running on EMR
writing a simple spark job running on EMR to create a table stored in Glue catalog but it fails to recognize the glue catalog databases and writing to spark default metastore.
EMR Configurations:-
...
0
votes
1
answer
1k
views
Got "Class software.amazon.msk.auth.iam.IAMClientCallbackHandler could not be found" in Amazon EMR
I have a Spark application. Here is the code sinking to Amazon MSK
val query = df.writeStream
.format("kafka")
.option(
"kafka.bootstrap.servers",
&...
0
votes
1
answer
4k
views
Got "java.lang.ClassNotFoundException: org.apache.spark.sql.catalyst.FileSourceOptions$" when spark-submit to Amazon EMR
I have a Spark application. My build.sbt looks like
name := "IngestFromS3ToKafka"
version := "1.0"
scalaVersion := "2.12.17"
resolvers += "confluent" at "...
0
votes
0
answers
513
views
java.io.FileNotFoundException: (Permission denied) while renaming the file within Spark application
Getting an exception when trying to rename a file within Spark application. Permission denied - new file name. The same thing works good with the spark-shell with by the same user. P.S. The path is ...
0
votes
1
answer
37
views
Conditionally filtering parquets based on parameters provided to function
I have a set of partitioned parquets I'm attempting to read in Spark. To simplify filtering, I've written a wrapper function to optionally allow filtering based on the parquets' partition columns. The ...
0
votes
1
answer
162
views
Submitting Multiple Jobs in Sequence
I'm having some trouble understanding how Spark allows for scheduling of jobs. I have a series of jobs I'd like to run in sequence. From what I've read, I can submit any number of jobs to spark-submit ...
0
votes
1
answer
473
views
Class org.apache-spark.SparkException, java.lang.NoClassDefFoundError: Could not initialize class XXX
Class org.apache-spark.SparkException, java.lang.NoClassDefFoundError: Could not initialize class XXX(class where field validation exists) Exception when I am trying to do field validations on Spark ...
0
votes
0
answers
445
views
Implementing exponential backoff in spark between consecutive attempts
On our AWS EMR spark cluster we have 100-150 jobs running at a given time. Each cluster-task-node is allocated 7 of 8 cores. But at some hour in a day the load average touches 10 --causing task ...
0
votes
0
answers
273
views
Issue when running the job in EMR 6.6.0 using Spark 3.2.0
My job(Spark-Scala) is running fine in emr-5.36.0 using spark 2.4.8 and scala 2.11.
But I am getting below error while running the same job in emr-6.6.0 using spark 3.2.0 and scala 2.12.15
Exception ...
0
votes
1
answer
32
views
How can I save file name in the tuples in scala
I have folder which contains many text files, I have to read this files in one RDD and save the file name with words on it
example :
doc1.txt :
" hello my name sam "
doc2.txt :
"hello ...
1
vote
1
answer
414
views
java.lang.VerifyError: Operand stack overflow
Was trying to connect to grpc from spark. Its working fine on my local but while testing it in AWS EMR ( after doing sbt assembly)- got conflict with spark package in emr,so shaded the libraries which ...
2
votes
0
answers
154
views
Persistence & Serialization of Scala case-classes on a cluster
I am looking for a robust, backwards-compatible, serialization & persistence solution for my case-classes.
General
my project is written in Scala.
It contains various case-classes that need to be ...
0
votes
0
answers
115
views
Failed to Save XGBoost Model in EMR 6.7
I am trying to train an Xgboost model in EMR 6.7 and save the model to S3 (Xgboost version is 1.3.1 and Spark version is 3.2.1). But when saving the model this error occurs: java.lang....
0
votes
0
answers
34
views
RDD collect OOM issues
I am trying to parse the input rdd into map with broadcast but running into memory issues on EMR. Below is the code
val data = sparkContext.textFile("inputPath")
val result = sparkContext....
0
votes
1
answer
208
views
java.lang.NoSuchMethodError: scala.Product.$init$(Lscala/Product;)V - Flink on EMR
I am trying to run a Flink (v 1.13.1) application on EMR ( v
5.34.0).
My Flink application uses Scallop(v 4.1.0) to parse the arguments passed.
Scala version used for Flink application is
2.12.7.
I ...
0
votes
0
answers
209
views
Failed to load class error due to seemingly fine code
I have been intermittently receiving the "Failed to load class [main class]" error due to seemingly working code and would like to understand why?
[ec2-user@ip-xxx localtesting]$ spark-...
0
votes
1
answer
391
views
Why did optimizing my Spark config slow it down?
I'm running Spark 2.2 (legacy codebase constraints) on an AWS EMR Cluster (version 5.8), and run a system of both Spark and Hadoop jobs on a cluster daily. I noticed that the configs that are ...
1
vote
0
answers
1k
views
Scala - java.lang.IllegalArgumentException: requirement failed: All input types must be the same except
Here I'm trying to add date to the data frame(232) but I'm getting exception on line number 234:
232 - val df5Final = df5Phone.withColumn(colName, regexp_replace(col(colName), date_pattern, "...
1
vote
2
answers
449
views
java.lang.VerifyError: Operand stack overflow for google-ads API and SBT
I am trying to migrate from Google-AdWords to google-ads-v10 API in spark 3.1.1 in EMR.
I am facing some dependency issues due to conflicts with existing jars.
Initially, we were facing a dependency ...
3
votes
0
answers
2k
views
Spark buffer holder size limit issue
I am doing the aggregate function on column level like
df.groupby("a").agg(collect_set(b))
The column value is increasing beyond default size of 2gb.
Error details:
Spark job fails with an ...
0
votes
1
answer
419
views
java.lang.ArrayIndexOutOfBoundsException: 1 while saving data frame in spark Scala
In EMR, we are using Salesforce Bulk API call to fetch records from salesforce object. For one of the Object(TASK) data frame while saving to parquet getting below error.
java.lang....
1
vote
0
answers
198
views
AWS EMR - Spark-shell \ Spark submit - Unrecognized VM option
I am getting below exception while invoking spark-shell from AWS emr
Unrecognized VM option 'CMSClassUnloaadingEnabled'
Did you mean '(+/-)CMSClassUnloadingEnabled'?
Error: Could not create the Java ...
0
votes
0
answers
1k
views
Spark scala config for S3 config --request-payer requester
I am writing Spark job to access S3 file in parquet format. The bucket owner has enabled request-payer config. I can access the S3 file using below cli command (without request-payer it gives access ...
1
vote
1
answer
455
views
Developing a multi-file Scala package in Spark EMR notebook
I'm basically looking for a way to do Spark-based scala development in EMR. So I'd have a couple of project files on the hadoop cluster:
// mypackage.scala
package mypackage
<Spark-dependent scala ...
1
vote
1
answer
3k
views
org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "s3"
I have this piece of code:
import org.apache.hadoop.fs.Path
import org.apache.hadoop.conf.Configuration
new Path("s3://bucket/key").getFileSystem(new Configuration())
When I run it locally ...
1
vote
0
answers
851
views
How to Run Apache Tika on Apache Spark
I am trying to run Apache Tika on Apache Spark on AWS EMR to perform distributed text extraction on a large collection of documents. I have built the Tika JAR with shaded dependencies as explained in ...
0
votes
1
answer
101
views
TensorFlowException: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for /mnt/yarn/usercache
I am trying to run "onto_electra_base_uncased" model on some data stored in hive table,
I ran count() on df before saving the data into hive table and got this exception.
Spark Shell launch ...
0
votes
0
answers
119
views
Glue table behaving differently for EMR than Athena
I have a table set up on glue, meta store. It points to a bunch of constantly updating parquet files on s3, inside a bucket with todays date on it. There is a new folder for each day.
If I am to query ...
1
vote
1
answer
2k
views
Spark on AWS EMR: use more executors
Long story short: How can I increase the number of executors in Spark on EMR?
Short story long:
I am running a pure compute scala spark job (Monte Carlo method to estimate Pi) on EMR 6.3 (Spark 3.1.1)....
0
votes
1
answer
510
views
Spark EMR job jackson error - java.lang.NoSuchMethodError - AnnotatedMember.getType()Lcom/fasterxml/jackson/databind/JavaType
I know we have similar questions already answered here. but for some reason none of the options are working for me.
Below is the error i'm getting:
User class threw exception: java.lang....
0
votes
1
answer
885
views
Add a new line in front of each line before writing to JSON format using Spark in Scala
I'd like to add one new line in front of each of my json document before Spark writes it into my s3 bucket:
df.createOrReplaceTempView("ParquetTable")
val parkSQL = spark.sql("select ...
-1
votes
1
answer
280
views
How can I add 2 (pyspark,scala) steps together in AWS EMR?
I want to add two steps together in AWS EMR cluster.
Step 1 is a pyspark based code, Step 2 is Scala-spark based code.
How do I achieve this?
1
vote
1
answer
840
views
Stream data from flink to S3
I am using Flink on amazon EMR and want to stream results of my pipeline to s3 bucket.
I am using Flink version => 1.11.2
This is a code snippet of how the code looks right now variable :
val ...
1
vote
1
answer
480
views
Spark Kryo deserialization of EMR-produced files fails locally
Upon upgrading EMR version to 6.2.0 (we previously used 5.0 beta - ish) and Spark 3.0.1, we noticed that we were unable to locally read Kryo files written from EMR clusters (this was obviously ...
0
votes
2
answers
190
views
How to run transformation in parallel spark
I am trying to read text.gz file, repartition it and do some transformation, however when I see the DAG the stag1 is reading the data and doing the transformation on only 1 task and it is taking time ...
1
vote
1
answer
734
views
Set maximizeResourceAllocation=true at EMR cluster level programatically in spark Scala
I'm trying to find a way to set the maximizeResourceAllocation=true property at the EMR cluster level in spark scala. I used --conf maximizeResourceAllocation=true argument with the spark-submit ...
0
votes
0
answers
73
views
How to make sure spark job is Utilizing all the available resource (utilizing all the containers)
I have a spark streaming job that reads data from kafka and puts into a Datawarehouse.
It is running perfectly fine but point of concerns is when I monitored the logs, I noticed that logs from my ...
1
vote
0
answers
453
views
Spark with AWS Elastic Search integration - Getting - Cloud instance without the proper setting 'es.nodes.wan.only' error
Create AWS Elastic search with public access and try to push dataframe into ES from my local , i am getting below exception
Exception in thread "main" org.elasticsearch.hadoop....
0
votes
0
answers
107
views
Where is output file created after running FileWriter in AWS EMR
This is how I am writing to file. (Scala code)
import java.io.FileWriter
val fw = new FileWriter("my_output_filename.txt", true)
fw.write("something to write into output file")
fw....
1
vote
1
answer
477
views
Running AWS EMR Spark: java.io.FileNotFoundException ... io.netty_netty-transport-native-epoll-4.1.59.Final.jar does not exist
When trying to run a Spark 3.0.1 job on AWS emr-6.1.0, the following error occurs:
Exception in thread "main" java.io.FileNotFoundException: File file:/home/hadoop/.ivy2/jars/io.netty_netty-...
1
vote
0
answers
729
views
How to fix java.lang.NoSuchMethodError on AWS EMR/Spark?
I am running a Spark project on AWS/EMR which includes the following dependency:
"com.google.apis" % "google-api-services-analyticsreporting" % "v4-rev20210106-1.31.0"
&...
0
votes
1
answer
1k
views
How to use Spark-Submit to run a scala file present on EMR cluster's master node?
So, I connect to my EMR cluster's master node using SSH. This is the file structure present in the master node:
|-- AnalysisRunner.scala
|-- AutomatedConstraints.scala
|-- deequ-1.0.1.jar
|-- new
| |...
0
votes
2
answers
219
views
spark scala 'take(10)' operation is taking too long
I got the following code within my super-simple Spark Scala application:
...
val t3 = System.currentTimeMillis
println("VertexRDD created in " + (t3 - t2) + " ms")
...