44,385 questions
0
votes
0
answers
10
views
Apache Spark: java.lang.NoClassDefFoundError for software.amazon.awssdk.transfer.s3.progress.TransferListener when reading CSV from S3
I am trying to read a CSV file from S3 using Apache Spark, but I encounter the following error:
java.lang.NoClassDefFoundError: software/amazon/awssdk/transfer/s3/progress/TransferListener
at java....
0
votes
0
answers
14
views
Error: `callbackHandler` may not be null when connecting to HDFS using Kerberos in Jakarta EE
I am trying to connect to HDFS using Kerberos authentication in a JakartaEE application. The connection code appears to be set up correctly, but I am encountering the following error when attempting ...
0
votes
1
answer
12
views
Flink-SQL dependencies : How to find in Marven Repo
I'm a beginner for apache platforms as well as flink. Im trying to query below Flink-SQL code. I have 2 questions
I need to find the connector "filesystem" (in maven repo or elsewhere)
Is ...
1
vote
1
answer
52
views
Can't configure Hive Metastore client jars in Spark 3.5.1
I need to configure my Spark 3.5.1 application so it uses specific version of Hive Metastore client.
I read in the documentation that I can use:
spark.sql.hive.metastore.jars
spark.sql.hive.metastore....
0
votes
0
answers
26
views
Spark - Could not initialize class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemConfiguration
I am testing a spark application locally, when it writes to GCS bucket I get the below error.
Error:
java.lang.reflect.InvocationTargetException
java.lang.RuntimeException: java.lang.reflect....
1
vote
1
answer
35
views
Spark and HDFS cluster running on docker
I'm trying to set up a Spark application running on my local machine to connect to an HDFS cluster where the NameNode is running inside a Docker container.
Here are the relevant details of my setup:
...
0
votes
0
answers
12
views
How to handle duplicate key outputs in the Mapper phase for HDFS PageRank implementation?
I was writing the PageRank code to run on HDFS, so I wrote the Mapper and Reducer. The data I have is in the following format: page 'outgoing_links,' such as:
Page_1 Page_18,Page_109,Page_696,...
0
votes
0
answers
49
views
Hadoop : Exception in thread "main" java.lang.UnsupportedOperationException: 'posix:permissions' not supported as initial attribute
I am using this command for word count in Command Prompt in Windows
hadoop jar "C:\hadoop\share\hadoop\mapreduce\hadoop-mapreduce-examples-3.4.0.jar" wordcount /newdir/HadoopSmall.txt /...
-1
votes
1
answer
26
views
Is the Hadoop documentation wrong for set
The documentation of the Hadoop Job API gives as example:
From https://hadoop.apache.org/docs/r3.3.5/api/org/apache/hadoop/mapreduce/Job.html
Here is an example on how to submit a job:
// ...
0
votes
0
answers
19
views
About flink application request numbers of slot when startup and runtime are difference question?
I found that when flink application start,the number of slots request is SUM(Maximum parallelism of each task), but when the application is running, the number of slots request is JobManager(1) + ...
1
vote
1
answer
70
views
how to set "api-version" dynamically in fs.azure.account.oauth2.msi.endpoint
Currently I'm using hadoop-azure-3.4.1 via pyspark library to connect to ABFS. According to the documentation - https://hadoop.apache.org/docs/stable/hadoop-azure/abfs.html#Azure_Managed_Identity - ...
0
votes
0
answers
16
views
Hadoop INIT_FAILD: Data is not Replicated on DataNode
I am using Hadoop 3.2.4 as standalone service on Windows 11 23H2 OS, I am trying to ingest the data from Apache Nifi into the hadoop hdfs two-node cluster.
NameNode (Also behaving as Datanode1) - ...
0
votes
0
answers
10
views
How to solve Unrecognized VM option 'UseConcMarkSweepGC' in tryuing to install Apache Hbase [duplicate]
When i try to run ./start-hbase.cmd i get the following error:
PS C:\hbasesetup\hbase-2.6.1\bin> ./start-hbase.cmd
Unrecognized VM option 'UseConcMarkSweepGC'
Error: Could not create the Java ...
-1
votes
0
answers
41
views
java.lang.NoClassDefFoundError: org.apache.hadoop.security.SecurityUtil (initialization failure)
I have a spark structured streaming application, with hadoop dependencies included. To support java 17 I have added below jvm args in build.gradle
test {
jvmArgs += [
'--add-exports=...
0
votes
1
answer
18
views
how building hadoop source code on macos m2 chip
I am trying to build hadoop source code on macOS(m2 chip) the system version macOS Ventura.
The problem:
ld: warning: ignoring file '/Library/Java/JavaVirtualMachines/zulu-8.jdk/Contents/Home/jre/lib/...
1
vote
1
answer
98
views
MERGE INTO TABLE is not supported temporarily. while trying to merge data into an iceberg table in hadoop hive environment
I need to perform a merge operation into my iceberg table. I am using a jupyter notebook on an Aws emr setup.
Spark: '3.4.1-amzn-2' Hadoop: 3.3.6 Hive: 3.1.3 EMR Version: 6.15.0 Scala: 'version 2.12....
0
votes
1
answer
42
views
Java action in Apache Oozie workflow
I am trying to configure an Apache Oozie workflow to execute different actions depending on the day of the week. After reading https://stackoverflow.com/questions/71422257/oozie-coordinator-get-day-of-...
1
vote
0
answers
19
views
Dataproc Hive Job - OutOfMemoryError: Java heap space
I have a dataproc cluster, we are running INSERT OVERWRITE QUERY through HIVE CLI which fails with OutOfMemoryError: Java heap space.
We adjusted memory configurations for reducers and Tez tasks, ...
0
votes
0
answers
38
views
java.net.ConnectException: Connection refused when trying to connect to HDFS from Spark on EMR to create an ICEBERG table
I am new to spark and I'm working with a Spark job on an AWS EMR cluster using jupyter notebook. I'm trying to interact with HDFS. Specifically, I am trying to create an Apache Iceberg table. However, ...
0
votes
1
answer
91
views
Missing PutHDFS Processor in Apache NiFi 2.0.0
I'm using Apache NiFi 2.0.0, which unfortunately does not include the PutHDFS processor. My project requires this version of NiFi due to its integration capabilities with Python scripting, so ...
0
votes
0
answers
42
views
Apache Nifi: Puthdfs Processor -replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and 1 node(s) are excluded
I am using Apache NIFI 1.28 version, i am trying to create a minimalistic data flow where i am generate the data and want to ingest in HDFS in `HDP (Hortonworks Data Platform) 2.5.0 , i am getting the ...
0
votes
1
answer
56
views
Apache Nifi: PutHDFS Processor issue - PutHDFS Failed to write to HDFS java.lang.NoClassDefFoundError: org/apache/hadoop/conf/Configurable
I am using Apache NIFI 1.28 version, i am trying to create a minimalistic data flow where i am generate the data and want to ingest in HDFS in `HDP (Hortonworks Data Platform) 2.5.0 , i am getting the ...
0
votes
1
answer
32
views
Interact with HBase running in Docker through Java code
I am quite new to Hbase and Java. I managed to run an HBase image in docker, I can interact with using the hbase shell smoothly. And I also have access to the UI for monitoring HBase. However, when I ...
0
votes
0
answers
49
views
PutHDFS Processor Failed to write in HDP2.5.0 HDFS
I am using Apache Nifi 1.28, i have configured the apache nifi in Windows system, My HDP2.5.0 is running on the VM, I want to ingest data from my local file system to HDFS service running in ...
0
votes
0
answers
18
views
Update statement in hadoop ecosystem using hive (WARNING: An illegal reflective access operation has occurred)
I am having trouble with running a simple update statement.
The query i am excecuting is the follow
NSERT INTO data_source_metadata VALUES (
'source1',
'User ...
0
votes
0
answers
29
views
How to properly (or at least somehow) use org.apache.hadoop.fs.MultipartUploader?
I have Hadoop in my environment and use it as S3.
My current task is to implement a logic for uploading large file (say, >1Gb) to Hadoop with no buffering, so the data should be streamed into it.
...
0
votes
0
answers
52
views
Failing to repartition data with PySpark on HDFS
I have over 7k parquet files in an HDFS dataset, totaling a little under 74 GB. One problem is that the file sizes are quite variable, ranging from 12 KB to 622 MB. So, I'd like to repartition the ...
0
votes
2
answers
186
views
java.lang.UnsupportedOperationException: 'posix:permissions'
I am trying to to wordcount opeartion using hadoop. Hadoop is configured and I can see datanode, namenode, resourcemanager, and nodemanager running. I am using hadoop version 3.4.0 and Java version 8. ...
0
votes
0
answers
37
views
How to connect airflow continer with hdfs container
Currently I'm using docker for HDFS and airflow both have share the network but I can't use HDFS commands inside the airflow container I've tried to install apache-airflow-providers-apache-hdfs but I ...
0
votes
0
answers
22
views
how does the CombineHiveInputFormat split the files(ORC/TextFile)
set maxsize of splits:
set mapreduce.input.fileinputformat.split.maxsize=209715200; --200MB(256MB default)
set mapreduce.input.fileinputformat.split.minsize=0;
run the hive sql on a table STORED AS ...
0
votes
1
answer
92
views
Exception in thread "main" java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z unresolved
I have looked through answers for similar issues and none have resolved the issue I'm having. Some hadoop commands seem to work (example hadoop fs -cat) while others not (hadoop fs -ls, which threw ...
1
vote
1
answer
74
views
Reading parquet takes too long in parquet-java
I am using parquet-hadoop to read a Snappy-compressed parquet file. However, I discovered that the reading time is quadratic to the file size, and it is unacceptably long.
The following is the code I ...
0
votes
0
answers
74
views
Pyspark, Hadoop, and S3: java.lang.NoSuchMethodError: org.apache.hadoop.fs.s3a.Listing$FileStatusListingIterator
I've been facing compatibility issues related to getting delta-spark to work straight out of the box with S3 and wanted to get some advice. I've tried dozens of combinations of versions between Spark, ...
0
votes
1
answer
42
views
Failed to transform org.apache.hadoop.fs.FileSystem$Cache. org.apache.hadoop.fs.FileSystem$Cache$Key class is frozen
I am trying to mock the hadoop filesystem in my scala test. Any Idea how to go around this please:
import java.net.URI
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs....
0
votes
0
answers
25
views
Unable to Move Deleted Files to Trash via Hadoop Web Interface
I have encountered an issue with the Hadoop-3.3.6 Web interface regarding file deletion. By default, when I delete files through the Hadoop Web interface, they are permanently removed and do not go to ...
-3
votes
1
answer
34
views
.gz files are unsplittable. But if I place them in HDFS, they create multiple blocks depending on block size
We all know .gz is non-splittable, that means only single core can read it. This means, when I place a huge .gz file on HDFS, it should actually be present as a single block. I see it is getting split ...
0
votes
1
answer
39
views
hdfs dfs -mkdir no such file or directory
I'm new to hadoop and I'm trying to create a directory in hdfs called input_dir. I have set up my vm, installed and started hadoop successfully.
This is the command that I run:
hdfs dfs -mkdir ...
1
vote
1
answer
72
views
Unable to connect to GCS Bucket using hadoop-connector over WIF
Getting bellow error while connecting to GCS with Hadoop
Error reading credentials from stream, 'type' value 'external_account' not recognized. Expecting 'authorized_user' or 'service_account'
I am ...
4
votes
2
answers
6k
views
Hadoop Installation, Error: getSubject is supported only if a security manager is allowed
I tried to install Hadoop on my macOS Ventura but has been failed several times. I tried download the lower versions of hadoop as well but no luck so far.
Tried Hadoop Versions: 3.4.0 and 3.3.6
Java ...
0
votes
1
answer
80
views
Hive Always Fails at Mapreduce
I just installed hadoop 3.3.6 and hive 4.0.0 with mysql as metastore. when running create table or select * from... it runs well. But when I try to do insert or select join, hive always fails. I'm ...
0
votes
0
answers
36
views
i set up a port forwarding on port 50070 to access the hadoop master node
$ ssh -i C:/Users/amyousufi/Desktop/private-key -L 50070:localhost:50070 [email protected]
after set up my port, i recieve the following error:
bind [127.0.0.1]:50070: Permission denied
...
0
votes
0
answers
28
views
acl for yarn capacity scheduler is not working
I have a cluster hadoop (1 master 1 slave) and I divided the resources into 2 queue: a and b then i use Acls to grant permissions to user1 can submit queue a, user2 can use queue b. I try run [user2@...
1
vote
0
answers
25
views
tracking URL disappears when Spark3.5.1 runs on Hadoop3
spark2.4, spark3.3.0, spark3.4.3 or any version of spark below 3.5, their tracking URL show ok for the same community version apache hadoop-yarn-3.2.
But spark 3.5.1 directly just show the app id in ...
0
votes
1
answer
43
views
Hql query through using YARN APPLICATION ID
So I want to know if I can get the HQL query or the SQL query using the applicationId of a hive query that is running on YARN.
I tried using
yarn logs applicationid
But it's showing the entire ...
1
vote
1
answer
289
views
How to set up a connection between Spark code and Spark container using Docker?
I am working with a Docker setup for Hadoop and Spark using the following repository: docker-hadoop-spark. My Docker Compose YAML configuration is working correctly, and I am able to run the ...
0
votes
0
answers
44
views
Custom NEWLINE symbol in Greenplum
I want to create extern table in Greenplum like this:
CREATE READABLE EXTERNAL TABLE "table"
("id" INTEGER, "name" TEXT)
LOCATION ('<file location>')
FORMAT 'TEXT' (...
0
votes
0
answers
26
views
Issues with ImportTsv - Job Fails with Exit Code 1
I am currently facing an issue while using the ImportTsv command to load data into an HBase table. I am running the command and the job is failing with a non-zero exit code 1.
Hadoop version : 3.4.0
...
0
votes
0
answers
92
views
Py4JJavaError: An error occurred while calling o117.showString
I am encountering a java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.rdd.RDDOperationScope$ error when trying to run a PySpark query to show tables from Hive. My environment ...
0
votes
0
answers
8
views
how many calls per handler are allowed in the queue
Maybe someone can tell me how to calculate the size of ipc.server.handler.queue.size core-site.xml
From the description of the property:
Specifies how many calls for each handler are allowed in the ...
0
votes
0
answers
13
views
Solr - Hadoop theory of data writing
I'm wondering how the data writing actually works.
In a Solr cluster that saves the indexes onto hadoop datanodes, if I instruct a shard splitting how is the data managed?
let's say that I split a ...