8,296 questions
0
votes
0
answers
14
views
Error: `callbackHandler` may not be null when connecting to HDFS using Kerberos in Jakarta EE
I am trying to connect to HDFS using Kerberos authentication in a JakartaEE application. The connection code appears to be set up correctly, but I am encountering the following error when attempting ...
1
vote
1
answer
35
views
Spark and HDFS cluster running on docker
I'm trying to set up a Spark application running on my local machine to connect to an HDFS cluster where the NameNode is running inside a Docker container.
Here are the relevant details of my setup:
...
1
vote
0
answers
20
views
How to Prevent Batch Jobs from Accessing Streaming Job Outputs While Writing to HDFS?
I'm developing a clickstream project that collects user events and stores them within HDFS.
You can see the project architecture on the diagram:
1. The Collector collects events via HTTP API and ...
0
votes
0
answers
25
views
How to normally output a path containing non-English characters in Popen
it's my function code
~~~~sql
DO
$$
from subprocess import PIPE,Popen
import os
import sys
env = os.environ.copy()
reload(sys)
sys.setdefaultencoding('utf-8')
plpy.notice(sys....
0
votes
0
answers
36
views
Data Retrieval from HDFS
I am trying to retrieve data from a table in HDFS in which a column contains timestamps.
I am connected in hdfs using CDSW and running a python script which opens a spark session and run an sql query ...
0
votes
0
answers
16
views
Hadoop INIT_FAILD: Data is not Replicated on DataNode
I am using Hadoop 3.2.4 as standalone service on Windows 11 23H2 OS, I am trying to ingest the data from Apache Nifi into the hadoop hdfs two-node cluster.
NameNode (Also behaving as Datanode1) - ...
1
vote
1
answer
48
views
How to set up Apache Impala on Windows using Docker? [closed]
Can anyone help me with a step-by-step guide or a docker-compose.yml file that can be used to set up Apache Impala with its required services (e.g., Impala Daemon, State Store, Catalog Service, HDFS) ...
0
votes
0
answers
38
views
java.net.ConnectException: Connection refused when trying to connect to HDFS from Spark on EMR to create an ICEBERG table
I am new to spark and I'm working with a Spark job on an AWS EMR cluster using jupyter notebook. I'm trying to interact with HDFS. Specifically, I am trying to create an Apache Iceberg table. However, ...
0
votes
1
answer
92
views
Missing PutHDFS Processor in Apache NiFi 2.0.0
I'm using Apache NiFi 2.0.0, which unfortunately does not include the PutHDFS processor. My project requires this version of NiFi due to its integration capabilities with Python scripting, so ...
0
votes
0
answers
42
views
Apache Nifi: Puthdfs Processor -replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and 1 node(s) are excluded
I am using Apache NIFI 1.28 version, i am trying to create a minimalistic data flow where i am generate the data and want to ingest in HDFS in `HDP (Hortonworks Data Platform) 2.5.0 , i am getting the ...
0
votes
1
answer
56
views
Apache Nifi: PutHDFS Processor issue - PutHDFS Failed to write to HDFS java.lang.NoClassDefFoundError: org/apache/hadoop/conf/Configurable
I am using Apache NIFI 1.28 version, i am trying to create a minimalistic data flow where i am generate the data and want to ingest in HDFS in `HDP (Hortonworks Data Platform) 2.5.0 , i am getting the ...
0
votes
0
answers
33
views
Incompatible configuration used between Spark and HBaseTestingUtility
We are using the MiniDFSCluster and MiniHbaseCluster from HBaseTestingUtility to run unit tests for our Spark jobs.
The Spark configuration that we use is :
conf.set("spark.sql....
1
vote
3
answers
170
views
Why is metadata consuming large amount of storage and how to optimize it?
I'm using PySpark with Apache Iceberg on an HDFS-based data lake, and I'm encountering significant storage issues. My application ingests real-time data every second. After approximately 2 hours, I ...
0
votes
0
answers
25
views
Unable to Move Deleted Files to Trash via Hadoop Web Interface
I have encountered an issue with the Hadoop-3.3.6 Web interface regarding file deletion. By default, when I delete files through the Hadoop Web interface, they are permanently removed and do not go to ...
-3
votes
1
answer
34
views
.gz files are unsplittable. But if I place them in HDFS, they create multiple blocks depending on block size
We all know .gz is non-splittable, that means only single core can read it. This means, when I place a huge .gz file on HDFS, it should actually be present as a single block. I see it is getting split ...
0
votes
1
answer
39
views
hdfs dfs -mkdir no such file or directory
I'm new to hadoop and I'm trying to create a directory in hdfs called input_dir. I have set up my vm, installed and started hadoop successfully.
This is the command that I run:
hdfs dfs -mkdir ...
0
votes
0
answers
36
views
i set up a port forwarding on port 50070 to access the hadoop master node
$ ssh -i C:/Users/amyousufi/Desktop/private-key -L 50070:localhost:50070 [email protected]
after set up my port, i recieve the following error:
bind [127.0.0.1]:50070: Permission denied
...
0
votes
0
answers
29
views
Cloudera HDFS connection is failing at Kerberos login
I am using this code to connect to Cloudera HDFS:
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs....
0
votes
0
answers
44
views
Custom NEWLINE symbol in Greenplum
I want to create extern table in Greenplum like this:
CREATE READABLE EXTERNAL TABLE "table"
("id" INTEGER, "name" TEXT)
LOCATION ('<file location>')
FORMAT 'TEXT' (...
0
votes
1
answer
73
views
Integrate hadoop HDFS with Snowflake
I'm building personal project, but I got stuck.
Specifically, after writing spark jobs to process & transform data, I load data into hadoop HDFS. Then I want to connect hdfs to snowflake, then I ...
0
votes
0
answers
50
views
One Stage/Job in Spark application only running on one executor
I have been running some tests regarding the execution of Spark and I have noticed a weird behaviour present in one of my tests. My spark cluster consists of 1 master node and 4 workers node, and the ...
0
votes
0
answers
13
views
Solr - Hadoop theory of data writing
I'm wondering how the data writing actually works.
In a Solr cluster that saves the indexes onto hadoop datanodes, if I instruct a shard splitting how is the data managed?
let's say that I split a ...
0
votes
0
answers
36
views
Data Ingestion through DataNodes on HDFS
I am ingesting or writing data on HDFS. I have 5 data nodes, and I wrote a PySpark script to ingest data from each data node. I am doing this because I want a replication factor of 1. However, when I ...
1
vote
0
answers
16
views
Combining small files into bigger ones didn't help improve storage efficiency on HDFS [closed]
On my HDFS, the block size is 512 MB.
I have an offline flow which generates data daily and outputs the data to HDFS.
I saw many small files (several KB) generated, so I updated the flow, making it ...
0
votes
1
answer
74
views
HDFS datanode: Couldn't resolve the hostname to an IP address
I'm configuring the Virtual Machine with Hadoop ecosystem using Docker, VirtualBox, with Ubuntu 24.04. For now, I'm using a docker-compose.yaml to run multiple services, including namenode, datanode, ...
0
votes
1
answer
26
views
Config spark to default read dat from hdfs
I've installed HDFS and Spark. However, how can I configure Spark to read from hdfs://localhost:9000/ by default? Currently, to load a file into a Spark DataFrame, I need to write spark.read.load(&...
1
vote
1
answer
31
views
hadoop HA with qjm wrong installation
this is my first installation of Hadoop HA with qjs and I'm having a lot of troubles from at least one whole week.
The lab is setup as follow
10.0.0.10 zoo1 solr1 had1
10.0.0.11 zoo2 solr2 had2
10.0.0....
0
votes
0
answers
66
views
greenplum pxf server create external tabe based on parquet hdfs file with partition in select
I have a parquet hdfs file with partitions and I want to create external table in greenplum with partition columns in it.
This is HDFS file
CREATE TABLE productshelf.funnel (
system_product_id ...
0
votes
0
answers
38
views
Sqoop: error when migrating table in pgadmin4 database to HDFS
I'm a newbie to Sqoop and HDFS and I'm trying to migrate a table in pgadmin4 to HDFS by using Sqoop. I spent a day to fix errors but finally I encountered this error that I cant find a solution ...
0
votes
0
answers
27
views
issues putting/copying local file to hdfs in docker
To put in context, I am working on a solution that requires data collection using kafka, and then storing that data into HDFS. The producer part in kafka is already working, but when it comes to the ...
1
vote
0
answers
52
views
How to setup HDFS with Geoserver?
I want to test the performance of HDFS with geospatial data. So I setup HDFS but unable to fetch the data from HDFS and give it to Geoserver. Is there any way to fetch the data from HDFS and give it ...
0
votes
0
answers
29
views
How does the hdfs journal nodes work internally?
I understand that Journal Nodes are like the central repo for all edit logs (no matter which namenode is currently active, all push to journal node). I suppose the QuorumJournalManager handles this ...
1
vote
0
answers
38
views
How to extract date part from folder name and move it to another folder on hdfs using pyspark
I currently have folders and sub-folders in day-wise structure in this path '/dev/data/'
2024.03.30
part-00001.avro
part-00002.avro
2024.03.31
part-00001.avro
part-00002.avro
2024.04.01
...
0
votes
1
answer
59
views
Hive on Tez insert data fail at org.apache.hive.com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 95
hive> insert into music values (1,'Faded');
Query ID = hadoopusr_20240625144523_fccb44c9-00e4-4a79-9d81-b5d56517a67d
Total jobs = 1
Launching Job 1 out of 1
Tez session was closed. Reopening...
...
0
votes
1
answer
53
views
HiBench running error hibench.hadoop.examples.jar not found
I'm trying to use Intel's HiBench to build workloads on gcp. I successfully managed to build the maven project and I did set the configurations as follow:
hadoop.conf:
# Hadoop home
hibench.hadoop....
0
votes
0
answers
56
views
Configuration HDFS, SPARK, ZEPPELIN
Project-structure:
ROOT:
- hadoop
- conf
- core-site.xml
- hdfs-site.xml
- local
- data
- .csv files
- zeppelin
- conf
- files .json
- notebook
- docker-compose
core-site.xml
&...
0
votes
1
answer
21
views
How to shift operation to the Secondary name node?
I have a three-node HDFS cluster with a Name Node/Datanode, a Secondary Name Node/Datanode, and a Datanode.
My primary Name Node burned in a fire, but the other two are just fine. How do I shift to ...
0
votes
0
answers
30
views
CalledProcessError throwing when executing HDFS cp command throw subprocess.check_output
I encountered a CalledProcessError when I ran the HDFS cp command using the subprocess.check_output function. Below is a sample of my program.
>>import subprocess
>>command = "hdfs ...
0
votes
0
answers
37
views
Error Starting Hadoop Services - JAVA_HOME is not set and could not be found
Brief Description of the Issue:
When attempting to start the Hadoop services within a Docker container, the Java Home (JAVA_HOME) is not being correctly found, resulting in errors when running the ...
0
votes
0
answers
21
views
How to data migrate large hdfs file (>12GB) from source to destination server without change "rpc-max-received-response-size"?
I am working on data migration on hdfs files from the old server to the new one.
I tried to use "hadoop dist" and "hdfs archive" to transfer large size of hdfs file from source to ...
0
votes
0
answers
98
views
docker - hdfs to hive issue
I have configured three containers that are networked because I intend to utilise Hadoop with Hive. You can access the Docker setup via https://github.com/jcool12/hadoophive/tree/main/hadoophive to ...
0
votes
1
answer
34
views
how to check which HDFS datanode ip is returned by namenode to spark?
If I'm reading/writing a dataframe in PySpark specifying HDFS name node hostname and port:
df.write.parquet("hdfs://namenode:8020/test/go", mode="overwrite")
Is there any way to ...
0
votes
1
answer
25
views
Impact of HDFS replication factor on namenode memory
Does increasing replication factor increases namenode memory usage in HDFS?
This link states replication factor do not have impact on namenode memory usage but another link states otherwise.
0
votes
1
answer
64
views
decommissioning data nodes in hdfs
I have some data nodes in Apache hdfs with replication factor 1 and want to decommission some of them and don't want to loss data which stored in them.
because of volume of data, can't download data ...
0
votes
0
answers
31
views
RPC response has invalid length of -16777216
I am trying to set up a Single Node Cluster(Hadoop).
But it seems that I can't type any command.
No matter what I type it always reply back like this
user@MSI:~/Downloads/hadoop-3.4.0$ bin/hdfs dfs -...
0
votes
1
answer
129
views
PutHDFS Nifi issue
Good morning, i want to create a Nifi flow from a certain URL to my HDFS. I created my HDFS cluster locally with my personal build and my Dockerfile and it is working, but when i try to use the the ...
0
votes
1
answer
53
views
Overwrite a hive table without downtime
I have a hive table which is associated with an HDFS path. The table is overwritten by a periodic job and has a few downstream consumers. The table gets dropped while being overwritten and if a ...
0
votes
1
answer
78
views
how to read a db.properties file or any other conf file in the project when deployed
how to read a db.properties file or any other conf file in the project when deployed in production?
i m getting this error...
24/05/09 16:34:32 INFO Client:
client token: N/A
...
0
votes
0
answers
14
views
Why hbase StoreFileSize (from jmx) and hdfs dfs -du -h are different?
i want to get the hbase table StoreFileSize,
but result of the command curl hostip:port/jmx?qry=Hadoop:service=HBase,name=RegionServer,sub=Tables are different from hdfs dfs -du -h /hbase/data/default/...
0
votes
1
answer
103
views
Spark read parquet files based on multiple partitions i.e., on DATE_KEY and BASE_FEED
I'm using PySpark to read parquet files from HDFS location partitioned by DATE_KEY. Following code always reads the file from the MAX(DATE_KEY) partition and converts to Polars dataframe.
def ...