Newest 'hdfs' Questions

0 votes

0 answers

14 views

Error: `callbackHandler` may not be null when connecting to HDFS using Kerberos in Jakarta EE

I am trying to connect to HDFS using Kerberos authentication in a JakartaEE application. The connection code appears to be set up correctly, but I am encountering the following error when attempting ...

ilham hitam

1

asked yesterday

1 vote

1 answer

35 views

Spark and HDFS cluster running on docker

I'm trying to set up a Spark application running on my local machine to connect to an HDFS cluster where the NameNode is running inside a Docker container. Here are the relevant details of my setup: ...

Paul Hartmann

11

asked Dec 2 at 17:10

1 vote

0 answers

20 views

How to Prevent Batch Jobs from Accessing Streaming Job Outputs While Writing to HDFS?

I'm developing a clickstream project that collects user events and stores them within HDFS. You can see the project architecture on the diagram: 1. The Collector collects events via HTTP API and ...

csgn

51

asked Nov 28 at 18:05

0 votes

0 answers

25 views

How to normally output a path containing non-English characters in Popen

it's my function code ~~~~sql DO $$ from subprocess import PIPE,Popen import os import sys env = os.environ.copy() reload(sys) sys.setdefaultencoding('utf-8') plpy.notice(sys....

mateo

89

asked Nov 26 at 8:58

0 votes

0 answers

36 views

Data Retrieval from HDFS

I am trying to retrieve data from a table in HDFS in which a column contains timestamps. I am connected in hdfs using CDSW and running a python script which opens a spark session and run an sql query ...

Nikitas

1

asked Nov 19 at 8:32

0 votes

0 answers

16 views

Hadoop INIT_FAILD: Data is not Replicated on DataNode

I am using Hadoop 3.2.4 as standalone service on Windows 11 23H2 OS, I am trying to ingest the data from Apache Nifi into the hadoop hdfs two-node cluster. NameNode (Also behaving as Datanode1) - ...

Filbadeha

401

asked Nov 18 at 7:45

1 vote

1 answer

48 views

How to set up Apache Impala on Windows using Docker? [closed]

Can anyone help me with a step-by-step guide or a docker-compose.yml file that can be used to set up Apache Impala with its required services (e.g., Impala Daemon, State Store, Catalog Service, HDFS) ...

Sezer Demir

98

asked Nov 9 at 10:40

0 votes

0 answers

38 views

java.net.ConnectException: Connection refused when trying to connect to HDFS from Spark on EMR to create an ICEBERG table

I am new to spark and I'm working with a Spark job on an AWS EMR cluster using jupyter notebook. I'm trying to interact with HDFS. Specifically, I am trying to create an Apache Iceberg table. However, ...

shady xv

23

asked Nov 7 at 8:03

0 votes

1 answer

92 views

Missing PutHDFS Processor in Apache NiFi 2.0.0

I'm using Apache NiFi 2.0.0, which unfortunately does not include the PutHDFS processor. My project requires this version of NiFi due to its integration capabilities with Python scripting, so ...

Filbadeha

401

asked Nov 6 at 5:58

0 votes

0 answers

42 views

Apache Nifi: Puthdfs Processor -replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and 1 node(s) are excluded

I am using Apache NIFI 1.28 version, i am trying to create a minimalistic data flow where i am generate the data and want to ingest in HDFS in `HDP (Hortonworks Data Platform) 2.5.0 , i am getting the ...

Filbadeha

401

asked Oct 31 at 10:48

0 votes

1 answer

56 views

Apache Nifi: PutHDFS Processor issue - PutHDFS Failed to write to HDFS java.lang.NoClassDefFoundError: org/apache/hadoop/conf/Configurable

I am using Apache NIFI 1.28 version, i am trying to create a minimalistic data flow where i am generate the data and want to ingest in HDFS in `HDP (Hortonworks Data Platform) 2.5.0 , i am getting the ...

Filbadeha

401

asked Oct 31 at 5:13

0 votes

0 answers

33 views

Incompatible configuration used between Spark and HBaseTestingUtility

We are using the MiniDFSCluster and MiniHbaseCluster from HBaseTestingUtility to run unit tests for our Spark jobs. The Spark configuration that we use is : conf.set("spark.sql....

Evelina Dumitrescu

23

asked Oct 30 at 2:38

1 vote

3 answers

170 views

Why is metadata consuming large amount of storage and how to optimize it?

I'm using PySpark with Apache Iceberg on an HDFS-based data lake, and I'm encountering significant storage issues. My application ingests real-time data every second. After approximately 2 hours, I ...

Oth Mane

46

asked Oct 27 at 16:02

0 votes

0 answers

25 views

Unable to Move Deleted Files to Trash via Hadoop Web Interface

I have encountered an issue with the Hadoop-3.3.6 Web interface regarding file deletion. By default, when I delete files through the Hadoop Web interface, they are permanently removed and do not go to ...

leizhuokin

1

asked Sep 26 at 10:49

-3 votes

1 answer

34 views

.gz files are unsplittable. But if I place them in HDFS, they create multiple blocks depending on block size

We all know .gz is non-splittable, that means only single core can read it. This means, when I place a huge .gz file on HDFS, it should actually be present as a single block. I see it is getting split ...

Praveen Kumar B N

109

asked Sep 26 at 1:55

0 votes

1 answer

39 views

hdfs dfs -mkdir no such file or directory

I'm new to hadoop and I'm trying to create a directory in hdfs called input_dir. I have set up my vm, installed and started hadoop successfully. This is the command that I run: hdfs dfs -mkdir ...

Bloom

1

asked Sep 24 at 18:47

0 votes

0 answers

36 views

i set up a port forwarding on port 50070 to access the hadoop master node

$ ssh -i C:/Users/amyousufi/Desktop/private-key -L 50070:localhost:50070 [email protected] after set up my port, i recieve the following error: bind [127.0.0.1]:50070: Permission denied ...

Aminullah Yousufi

1

asked Sep 12 at 13:02

0 votes

0 answers

29 views

Cloudera HDFS connection is failing at Kerberos login

I am using this code to connect to Cloudera HDFS: import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.fs....

Ramakrishna rayudu

1

asked Sep 11 at 17:05

0 votes

0 answers

44 views

Custom NEWLINE symbol in Greenplum

I want to create extern table in Greenplum like this: CREATE READABLE EXTERNAL TABLE "table" ("id" INTEGER, "name" TEXT) LOCATION ('<file location>') FORMAT 'TEXT' (...

Vadim Myakish

1

asked Sep 9 at 9:57

0 votes

1 answer

73 views

Integrate hadoop HDFS with Snowflake

I'm building personal project, but I got stuck. Specifically, after writing spark jobs to process & transform data, I load data into hadoop HDFS. Then I want to connect hdfs to snowflake, then I ...

user24366017

1

asked Sep 6 at 10:13

0 votes

0 answers

50 views

One Stage/Job in Spark application only running on one executor

I have been running some tests regarding the execution of Spark and I have noticed a weird behaviour present in one of my tests. My spark cluster consists of 1 master node and 4 workers node, and the ...

Junbo Du

1

asked Sep 3 at 17:11

0 votes

0 answers

13 views

Solr - Hadoop theory of data writing

I'm wondering how the data writing actually works. In a Solr cluster that saves the indexes onto hadoop datanodes, if I instruct a shard splitting how is the data managed? let's say that I split a ...

Roberto D. Maggi

43

asked Sep 2 at 13:39

0 votes

0 answers

36 views

Data Ingestion through DataNodes on HDFS

I am ingesting or writing data on HDFS. I have 5 data nodes, and I wrote a PySpark script to ingest data from each data node. I am doing this because I want a replication factor of 1. However, when I ...

Hasnat saghir

1

asked Aug 19 at 7:22

1 vote

0 answers

16 views

Combining small files into bigger ones didn't help improve storage efficiency on HDFS [closed]

On my HDFS, the block size is 512 MB. I have an offline flow which generates data daily and outputs the data to HDFS. I saw many small files (several KB) generated, so I updated the flow, making it ...

Zijiang Hao

11

asked Aug 16 at 0:12

0 votes

1 answer

74 views

HDFS datanode: Couldn't resolve the hostname to an IP address

I'm configuring the Virtual Machine with Hadoop ecosystem using Docker, VirtualBox, with Ubuntu 24.04. For now, I'm using a docker-compose.yaml to run multiple services, including namenode, datanode, ...

Woden

1,346

asked Aug 15 at 12:56

0 votes

1 answer

26 views

Config spark to default read dat from hdfs

I've installed HDFS and Spark. However, how can I configure Spark to read from hdfs://localhost:9000/ by default? Currently, to load a file into a Spark DataFrame, I need to write spark.read.load(&...

Trần Quang Đạt

1

asked Aug 12 at 2:29

1 vote

1 answer

31 views

hadoop HA with qjm wrong installation

this is my first installation of Hadoop HA with qjs and I'm having a lot of troubles from at least one whole week. The lab is setup as follow 10.0.0.10 zoo1 solr1 had1 10.0.0.11 zoo2 solr2 had2 10.0.0....

Roberto D. Maggi

43

asked Aug 9 at 14:18

0 votes

0 answers

66 views

greenplum pxf server create external tabe based on parquet hdfs file with partition in select

I have a parquet hdfs file with partitions and I want to create external table in greenplum with partition columns in it. This is HDFS file CREATE TABLE productshelf.funnel ( system_product_id ...

user_Dima

57

asked Aug 6 at 6:07

0 votes

0 answers

38 views

Sqoop: error when migrating table in pgadmin4 database to HDFS

I'm a newbie to Sqoop and HDFS and I'm trying to migrate a table in pgadmin4 to HDFS by using Sqoop. I spent a day to fix errors but finally I encountered this error that I cant find a solution ...

Bùi Nam Quân

11

asked Aug 4 at 16:55

0 votes

0 answers

27 views

issues putting/copying local file to hdfs in docker

To put in context, I am working on a solution that requires data collection using kafka, and then storing that data into HDFS. The producer part in kafka is already working, but when it comes to the ...

Assou Iben jellal

1

asked Aug 1 at 13:18

1 vote

0 answers

52 views

How to setup HDFS with Geoserver?

I want to test the performance of HDFS with geospatial data. So I setup HDFS but unable to fetch the data from HDFS and give it to Geoserver. Is there any way to fetch the data from HDFS and give it ...

Sathvika M

11

asked Jul 16 at 5:21

0 votes

0 answers

29 views

How does the hdfs journal nodes work internally?

I understand that Journal Nodes are like the central repo for all edit logs (no matter which namenode is currently active, all push to journal node). I suppose the QuorumJournalManager handles this ...

Nishant Neupane

35

asked Jul 6 at 4:01

1 vote

0 answers

38 views

How to extract date part from folder name and move it to another folder on hdfs using pyspark

I currently have folders and sub-folders in day-wise structure in this path '/dev/data/' 2024.03.30 part-00001.avro part-00002.avro 2024.03.31 part-00001.avro part-00002.avro 2024.04.01 ...

user175025

434

asked Jul 5 at 14:19

0 votes

1 answer

59 views

Hive on Tez insert data fail at org.apache.hive.com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 95

hive> insert into music values (1,'Faded'); Query ID = hadoopusr_20240625144523_fccb44c9-00e4-4a79-9d81-b5d56517a67d Total jobs = 1 Launching Job 1 out of 1 Tez session was closed. Reopening... ...

Kaung Htet Hein

1

asked Jun 25 at 8:58

0 votes

1 answer

53 views

HiBench running error hibench.hadoop.examples.jar not found

I'm trying to use Intel's HiBench to build workloads on gcp. I successfully managed to build the maven project and I did set the configurations as follow: hadoop.conf: # Hadoop home hibench.hadoop....

Wajih101

11

asked Jun 20 at 9:04

0 votes

0 answers

56 views

Configuration HDFS, SPARK, ZEPPELIN

Project-structure: ROOT: - hadoop - conf - core-site.xml - hdfs-site.xml - local - data - .csv files - zeppelin - conf - files .json - notebook - docker-compose core-site.xml &...

Fruitcake_Gary

13

asked Jun 18 at 11:31

0 votes

1 answer

21 views

How to shift operation to the Secondary name node?

I have a three-node HDFS cluster with a Name Node/Datanode, a Secondary Name Node/Datanode, and a Datanode. My primary Name Node burned in a fire, but the other two are just fine. How do I shift to ...

Robert Rapplean

669

asked Jun 13 at 20:34

0 votes

0 answers

30 views

CalledProcessError throwing when executing HDFS cp command throw subprocess.check_output

I encountered a CalledProcessError when I ran the HDFS cp command using the subprocess.check_output function. Below is a sample of my program. >>import subprocess >>command = "hdfs ...

Valli69

9,852

asked Jun 13 at 6:27

0 votes

0 answers

37 views

Error Starting Hadoop Services - JAVA_HOME is not set and could not be found

Brief Description of the Issue: When attempting to start the Hadoop services within a Docker container, the Java Home (JAVA_HOME) is not being correctly found, resulting in errors when running the ...

Felipe

49

asked Jun 9 at 4:25

0 votes

0 answers

21 views

How to data migrate large hdfs file (>12GB) from source to destination server without change "rpc-max-received-response-size"?

I am working on data migration on hdfs files from the old server to the new one. I tried to use "hadoop dist" and "hdfs archive" to transfer large size of hdfs file from source to ...

Leon

468

asked May 30 at 20:38

0 votes

0 answers

98 views

docker - hdfs to hive issue

I have configured three containers that are networked because I intend to utilise Hadoop with Hive. You can access the Docker setup via https://github.com/jcool12/hadoophive/tree/main/hadoophive to ...

jr134

83

asked May 28 at 23:14

0 votes

1 answer

34 views

how to check which HDFS datanode ip is returned by namenode to spark?

If I'm reading/writing a dataframe in PySpark specifying HDFS name node hostname and port: df.write.parquet("hdfs://namenode:8020/test/go", mode="overwrite") Is there any way to ...

John Black

355

asked May 26 at 21:07

0 votes

1 answer

25 views

Impact of HDFS replication factor on namenode memory

Does increasing replication factor increases namenode memory usage in HDFS? This link states replication factor do not have impact on namenode memory usage but another link states otherwise.

Dan

1,464

asked May 22 at 6:23

0 votes

1 answer

64 views

decommissioning data nodes in hdfs

I have some data nodes in Apache hdfs with replication factor 1 and want to decommission some of them and don't want to loss data which stored in them. because of volume of data, can't download data ...

Mohammad Miri

1

asked May 20 at 6:32

0 votes

0 answers

31 views

RPC response has invalid length of -16777216

I am trying to set up a Single Node Cluster(Hadoop). But it seems that I can't type any command. No matter what I type it always reply back like this user@MSI:~/Downloads/hadoop-3.4.0$ bin/hdfs dfs -...

Typedef

5

asked May 19 at 7:39

0 votes

1 answer

129 views

PutHDFS Nifi issue

Good morning, i want to create a Nifi flow from a certain URL to my HDFS. I created my HDFS cluster locally with my personal build and my Dockerfile and it is working, but when i try to use the the ...

Numero8

13

asked May 19 at 4:33

0 votes

1 answer

53 views

Overwrite a hive table without downtime

I have a hive table which is associated with an HDFS path. The table is overwritten by a periodic job and has a few downstream consumers. The table gets dropped while being overwritten and if a ...

Utkarsh Roy

43

asked May 16 at 22:33

0 votes

1 answer

78 views

how to read a db.properties file or any other conf file in the project when deployed

how to read a db.properties file or any other conf file in the project when deployed in production? i m getting this error... 24/05/09 16:34:32 INFO Client: client token: N/A ...

Harsh Agrawal

1

asked May 16 at 10:39

0 votes

0 answers

14 views

Why hbase StoreFileSize (from jmx) and hdfs dfs -du -h are different?

i want to get the hbase table StoreFileSize, but result of the command curl hostip:port/jmx?qry=Hadoop:service=HBase,name=RegionServer,sub=Tables are different from hdfs dfs -du -h /hbase/data/default/...

ray

43

asked May 16 at 9:00

0 votes

1 answer

103 views

Spark read parquet files based on multiple partitions i.e., on DATE_KEY and BASE_FEED

I'm using PySpark to read parquet files from HDFS location partitioned by DATE_KEY. Following code always reads the file from the MAX(DATE_KEY) partition and converts to Polars dataframe. def ...

Balaji Venkatachalam

363

asked May 14 at 4:53

Collectives™ on Stack Overflow

Related Tags