Skip to main content
Filter by
Sorted by
Tagged with
0 votes
0 answers
14 views

Error: `callbackHandler` may not be null when connecting to HDFS using Kerberos in Jakarta EE

I am trying to connect to HDFS using Kerberos authentication in a JakartaEE application. The connection code appears to be set up correctly, but I am encountering the following error when attempting ...
ilham hitam's user avatar
1 vote
1 answer
35 views

Spark and HDFS cluster running on docker

I'm trying to set up a Spark application running on my local machine to connect to an HDFS cluster where the NameNode is running inside a Docker container. Here are the relevant details of my setup: ...
Paul Hartmann's user avatar
1 vote
0 answers
20 views

How to Prevent Batch Jobs from Accessing Streaming Job Outputs While Writing to HDFS?

I'm developing a clickstream project that collects user events and stores them within HDFS. You can see the project architecture on the diagram: 1. The Collector collects events via HTTP API and ...
csgn's user avatar
  • 51
0 votes
0 answers
25 views

How to normally output a path containing non-English characters in Popen

it's my function code ~~~~sql DO $$ from subprocess import PIPE,Popen import os import sys env = os.environ.copy() reload(sys) sys.setdefaultencoding('utf-8') plpy.notice(sys....
mateo's user avatar
  • 89
0 votes
0 answers
36 views

Data Retrieval from HDFS

I am trying to retrieve data from a table in HDFS in which a column contains timestamps. I am connected in hdfs using CDSW and running a python script which opens a spark session and run an sql query ...
Nikitas's user avatar
0 votes
0 answers
16 views

Hadoop INIT_FAILD: Data is not Replicated on DataNode

I am using Hadoop 3.2.4 as standalone service on Windows 11 23H2 OS, I am trying to ingest the data from Apache Nifi into the hadoop hdfs two-node cluster. NameNode (Also behaving as Datanode1) - ...
Filbadeha's user avatar
  • 401
1 vote
1 answer
48 views

How to set up Apache Impala on Windows using Docker? [closed]

Can anyone help me with a step-by-step guide or a docker-compose.yml file that can be used to set up Apache Impala with its required services (e.g., Impala Daemon, State Store, Catalog Service, HDFS) ...
Sezer Demir's user avatar
0 votes
0 answers
38 views

java.net.ConnectException: Connection refused when trying to connect to HDFS from Spark on EMR to create an ICEBERG table

I am new to spark and I'm working with a Spark job on an AWS EMR cluster using jupyter notebook. I'm trying to interact with HDFS. Specifically, I am trying to create an Apache Iceberg table. However, ...
shady xv's user avatar
0 votes
1 answer
92 views

Missing PutHDFS Processor in Apache NiFi 2.0.0

I'm using Apache NiFi 2.0.0, which unfortunately does not include the PutHDFS processor. My project requires this version of NiFi due to its integration capabilities with Python scripting, so ...
Filbadeha's user avatar
  • 401
0 votes
0 answers
42 views

Apache Nifi: Puthdfs Processor -replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and 1 node(s) are excluded

I am using Apache NIFI 1.28 version, i am trying to create a minimalistic data flow where i am generate the data and want to ingest in HDFS in `HDP (Hortonworks Data Platform) 2.5.0 , i am getting the ...
Filbadeha's user avatar
  • 401
0 votes
1 answer
56 views

Apache Nifi: PutHDFS Processor issue - PutHDFS Failed to write to HDFS java.lang.NoClassDefFoundError: org/apache/hadoop/conf/Configurable

I am using Apache NIFI 1.28 version, i am trying to create a minimalistic data flow where i am generate the data and want to ingest in HDFS in `HDP (Hortonworks Data Platform) 2.5.0 , i am getting the ...
Filbadeha's user avatar
  • 401
0 votes
0 answers
33 views

Incompatible configuration used between Spark and HBaseTestingUtility

We are using the MiniDFSCluster and MiniHbaseCluster from HBaseTestingUtility to run unit tests for our Spark jobs. The Spark configuration that we use is : conf.set("spark.sql....
Evelina Dumitrescu's user avatar
1 vote
3 answers
170 views

Why is metadata consuming large amount of storage and how to optimize it?

I'm using PySpark with Apache Iceberg on an HDFS-based data lake, and I'm encountering significant storage issues. My application ingests real-time data every second. After approximately 2 hours, I ...
Oth Mane's user avatar
0 votes
0 answers
25 views

Unable to Move Deleted Files to Trash via Hadoop Web Interface

I have encountered an issue with the Hadoop-3.3.6 Web interface regarding file deletion. By default, when I delete files through the Hadoop Web interface, they are permanently removed and do not go to ...
leizhuokin's user avatar
-3 votes
1 answer
34 views

.gz files are unsplittable. But if I place them in HDFS, they create multiple blocks depending on block size

We all know .gz is non-splittable, that means only single core can read it. This means, when I place a huge .gz file on HDFS, it should actually be present as a single block. I see it is getting split ...
Praveen Kumar B N's user avatar
0 votes
1 answer
39 views

hdfs dfs -mkdir no such file or directory

I'm new to hadoop and I'm trying to create a directory in hdfs called input_dir. I have set up my vm, installed and started hadoop successfully. This is the command that I run: hdfs dfs -mkdir ...
Bloom's user avatar
  • 1
0 votes
0 answers
36 views

i set up a port forwarding on port 50070 to access the hadoop master node

$ ssh -i C:/Users/amyousufi/Desktop/private-key -L 50070:localhost:50070 [email protected] after set up my port, i recieve the following error: bind [127.0.0.1]:50070: Permission denied ...
Aminullah Yousufi's user avatar
0 votes
0 answers
29 views

Cloudera HDFS connection is failing at Kerberos login

I am using this code to connect to Cloudera HDFS: import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.fs....
Ramakrishna rayudu's user avatar
0 votes
0 answers
44 views

Custom NEWLINE symbol in Greenplum

I want to create extern table in Greenplum like this: CREATE READABLE EXTERNAL TABLE "table" ("id" INTEGER, "name" TEXT) LOCATION ('<file location>') FORMAT 'TEXT' (...
Vadim Myakish's user avatar
0 votes
1 answer
73 views

Integrate hadoop HDFS with Snowflake

I'm building personal project, but I got stuck. Specifically, after writing spark jobs to process & transform data, I load data into hadoop HDFS. Then I want to connect hdfs to snowflake, then I ...
user24366017's user avatar
0 votes
0 answers
50 views

One Stage/Job in Spark application only running on one executor

I have been running some tests regarding the execution of Spark and I have noticed a weird behaviour present in one of my tests. My spark cluster consists of 1 master node and 4 workers node, and the ...
Junbo Du's user avatar
0 votes
0 answers
13 views

Solr - Hadoop theory of data writing

I'm wondering how the data writing actually works. In a Solr cluster that saves the indexes onto hadoop datanodes, if I instruct a shard splitting how is the data managed? let's say that I split a ...
Roberto D. Maggi's user avatar
0 votes
0 answers
36 views

Data Ingestion through DataNodes on HDFS

I am ingesting or writing data on HDFS. I have 5 data nodes, and I wrote a PySpark script to ingest data from each data node. I am doing this because I want a replication factor of 1. However, when I ...
Hasnat saghir's user avatar
1 vote
0 answers
16 views

Combining small files into bigger ones didn't help improve storage efficiency on HDFS [closed]

On my HDFS, the block size is 512 MB. I have an offline flow which generates data daily and outputs the data to HDFS. I saw many small files (several KB) generated, so I updated the flow, making it ...
Zijiang Hao's user avatar
0 votes
1 answer
74 views

HDFS datanode: Couldn't resolve the hostname to an IP address

I'm configuring the Virtual Machine with Hadoop ecosystem using Docker, VirtualBox, with Ubuntu 24.04. For now, I'm using a docker-compose.yaml to run multiple services, including namenode, datanode, ...
Woden's user avatar
  • 1,346
0 votes
1 answer
26 views

Config spark to default read dat from hdfs

I've installed HDFS and Spark. However, how can I configure Spark to read from hdfs://localhost:9000/ by default? Currently, to load a file into a Spark DataFrame, I need to write spark.read.load(&...
Trần Quang Đạt's user avatar
1 vote
1 answer
31 views

hadoop HA with qjm wrong installation

this is my first installation of Hadoop HA with qjs and I'm having a lot of troubles from at least one whole week. The lab is setup as follow 10.0.0.10 zoo1 solr1 had1 10.0.0.11 zoo2 solr2 had2 10.0.0....
Roberto D. Maggi's user avatar
0 votes
0 answers
66 views

greenplum pxf server create external tabe based on parquet hdfs file with partition in select

I have a parquet hdfs file with partitions and I want to create external table in greenplum with partition columns in it. This is HDFS file CREATE TABLE productshelf.funnel ( system_product_id ...
user_Dima's user avatar
0 votes
0 answers
38 views

Sqoop: error when migrating table in pgadmin4 database to HDFS

I'm a newbie to Sqoop and HDFS and I'm trying to migrate a table in pgadmin4 to HDFS by using Sqoop. I spent a day to fix errors but finally I encountered this error that I cant find a solution ...
Bùi Nam Quân's user avatar
0 votes
0 answers
27 views

issues putting/copying local file to hdfs in docker

To put in context, I am working on a solution that requires data collection using kafka, and then storing that data into HDFS. The producer part in kafka is already working, but when it comes to the ...
Assou Iben jellal's user avatar
1 vote
0 answers
52 views

How to setup HDFS with Geoserver?

I want to test the performance of HDFS with geospatial data. So I setup HDFS but unable to fetch the data from HDFS and give it to Geoserver. Is there any way to fetch the data from HDFS and give it ...
Sathvika M's user avatar
0 votes
0 answers
29 views

How does the hdfs journal nodes work internally?

I understand that Journal Nodes are like the central repo for all edit logs (no matter which namenode is currently active, all push to journal node). I suppose the QuorumJournalManager handles this ...
Nishant Neupane's user avatar
1 vote
0 answers
38 views

How to extract date part from folder name and move it to another folder on hdfs using pyspark

I currently have folders and sub-folders in day-wise structure in this path '/dev/data/' 2024.03.30 part-00001.avro part-00002.avro 2024.03.31 part-00001.avro part-00002.avro 2024.04.01 ...
user175025's user avatar
0 votes
1 answer
59 views

Hive on Tez insert data fail at org.apache.hive.com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 95

hive> insert into music values (1,'Faded'); Query ID = hadoopusr_20240625144523_fccb44c9-00e4-4a79-9d81-b5d56517a67d Total jobs = 1 Launching Job 1 out of 1 Tez session was closed. Reopening... ...
Kaung Htet Hein's user avatar
0 votes
1 answer
53 views

HiBench running error hibench.hadoop.examples.jar not found

I'm trying to use Intel's HiBench to build workloads on gcp. I successfully managed to build the maven project and I did set the configurations as follow: hadoop.conf: # Hadoop home hibench.hadoop....
Wajih101's user avatar
0 votes
0 answers
56 views

Configuration HDFS, SPARK, ZEPPELIN

Project-structure: ROOT: - hadoop - conf - core-site.xml - hdfs-site.xml - local - data - .csv files - zeppelin - conf - files .json - notebook - docker-compose core-site.xml &...
Fruitcake_Gary's user avatar
0 votes
1 answer
21 views

How to shift operation to the Secondary name node?

I have a three-node HDFS cluster with a Name Node/Datanode, a Secondary Name Node/Datanode, and a Datanode. My primary Name Node burned in a fire, but the other two are just fine. How do I shift to ...
Robert Rapplean's user avatar
0 votes
0 answers
30 views

CalledProcessError throwing when executing HDFS cp command throw subprocess.check_output

I encountered a CalledProcessError when I ran the HDFS cp command using the subprocess.check_output function. Below is a sample of my program. >>import subprocess >>command = "hdfs ...
Valli69's user avatar
  • 9,852
0 votes
0 answers
37 views

Error Starting Hadoop Services - JAVA_HOME is not set and could not be found

Brief Description of the Issue: When attempting to start the Hadoop services within a Docker container, the Java Home (JAVA_HOME) is not being correctly found, resulting in errors when running the ...
Felipe's user avatar
  • 49
0 votes
0 answers
21 views

How to data migrate large hdfs file (>12GB) from source to destination server without change "rpc-max-received-response-size"?

I am working on data migration on hdfs files from the old server to the new one. I tried to use "hadoop dist" and "hdfs archive" to transfer large size of hdfs file from source to ...
Leon's user avatar
  • 468
0 votes
0 answers
98 views

docker - hdfs to hive issue

I have configured three containers that are networked because I intend to utilise Hadoop with Hive. You can access the Docker setup via https://github.com/jcool12/hadoophive/tree/main/hadoophive to ...
jr134's user avatar
  • 83
0 votes
1 answer
34 views

how to check which HDFS datanode ip is returned by namenode to spark?

If I'm reading/writing a dataframe in PySpark specifying HDFS name node hostname and port: df.write.parquet("hdfs://namenode:8020/test/go", mode="overwrite") Is there any way to ...
John Black's user avatar
0 votes
1 answer
25 views

Impact of HDFS replication factor on namenode memory

Does increasing replication factor increases namenode memory usage in HDFS? This link states replication factor do not have impact on namenode memory usage but another link states otherwise.
Dan's user avatar
  • 1,464
0 votes
1 answer
64 views

decommissioning data nodes in hdfs

I have some data nodes in Apache hdfs with replication factor 1 and want to decommission some of them and don't want to loss data which stored in them. because of volume of data, can't download data ...
Mohammad Miri's user avatar
0 votes
0 answers
31 views

RPC response has invalid length of -16777216

I am trying to set up a Single Node Cluster(Hadoop). But it seems that I can't type any command. No matter what I type it always reply back like this user@MSI:~/Downloads/hadoop-3.4.0$ bin/hdfs dfs -...
Typedef's user avatar
0 votes
1 answer
129 views

PutHDFS Nifi issue

Good morning, i want to create a Nifi flow from a certain URL to my HDFS. I created my HDFS cluster locally with my personal build and my Dockerfile and it is working, but when i try to use the the ...
Numero8's user avatar
  • 13
0 votes
1 answer
53 views

Overwrite a hive table without downtime

I have a hive table which is associated with an HDFS path. The table is overwritten by a periodic job and has a few downstream consumers. The table gets dropped while being overwritten and if a ...
Utkarsh Roy's user avatar
0 votes
1 answer
78 views

how to read a db.properties file or any other conf file in the project when deployed

how to read a db.properties file or any other conf file in the project when deployed in production? i m getting this error... 24/05/09 16:34:32 INFO Client: client token: N/A ...
Harsh Agrawal's user avatar
0 votes
0 answers
14 views

Why hbase StoreFileSize (from jmx) and hdfs dfs -du -h are different?

i want to get the hbase table StoreFileSize, but result of the command curl hostip:port/jmx?qry=Hadoop:service=HBase,name=RegionServer,sub=Tables are different from hdfs dfs -du -h /hbase/data/default/...
ray's user avatar
  • 43
0 votes
1 answer
103 views

Spark read parquet files based on multiple partitions i.e., on DATE_KEY and BASE_FEED

I'm using PySpark to read parquet files from HDFS location partitioned by DATE_KEY. Following code always reads the file from the MAX(DATE_KEY) partition and converts to Polars dataframe. def ...
Balaji Venkatachalam's user avatar

1
2 3 4 5
166