Newest 'hdfs+python' Questions

0 votes

0 answers

25 views

How to normally output a path containing non-English characters in Popen

it's my function code ~~~~sql DO $$ from subprocess import PIPE,Popen import os import sys env = os.environ.copy() reload(sys) sys.setdefaultencoding('utf-8') plpy.notice(sys....

mateo

89

asked Nov 26 at 8:58

0 votes

0 answers

27 views

issues putting/copying local file to hdfs in docker

To put in context, I am working on a solution that requires data collection using kafka, and then storing that data into HDFS. The producer part in kafka is already working, but when it comes to the ...

Assou Iben jellal

1

asked Aug 1 at 13:18

1 vote

0 answers

38 views

How to extract date part from folder name and move it to another folder on hdfs using pyspark

I currently have folders and sub-folders in day-wise structure in this path '/dev/data/' 2024.03.30 part-00001.avro part-00002.avro 2024.03.31 part-00001.avro part-00002.avro 2024.04.01 ...

user175025

434

asked Jul 5 at 14:19

1 vote

0 answers

534 views

Spark execute "Listing leaf files and directories" while reading partitioned dataset on HDFS

I have a dataset on HDFS, and it's partitioned on 2 columns: ym, ymd, and eventName. While I read the dataset using root path, Spark create jobs to list all leaf files on the directory. The problem is ...

gohan

87

asked May 8 at 8:05

0 votes

0 answers

60 views

How can I read a pickle file in HDFS?

I have developed a deep learning model using PyTorch, then I saved it as a pickle file. Now, I am working on HDFS and I have a table named XX. I want to use my deep learning model to get a result ...

Ilyes Djerfaf

24

asked Apr 25 at 13:13

0 votes

0 answers

138 views

ClickHouse Server Exception: Code: 210.DB::Exception: Fail to read from HDFS:

I'm trying to migrate data from hdfs to clickhouse from hdfs. Sometimes the showcase is assembled without problems, but most often it falls with such an error. I tried to wrap it in try except so that ...

howtoplay112

23

asked Mar 13 at 9:37

0 votes

0 answers

64 views

How do I directly read files from HDFS using dask?

I have a dask script where I convert a sas7bdat file, using the libraries dask-yarn to deploy to a YARN cluster, and dask_sas_reader for the conversion. The dask_sas_reader depends on pyreadstat in ...

aerylias

1

asked Feb 22 at 20:42

0 votes

0 answers

43 views

hdfs library will not load in an HDinsight jupyter notebook

I have a problem in the HDInsight Jupyter Notebook. I cannot access outside files. I am trying to access files on the HDInsight cluster head node which I can ssh to using my username and password from ...

M C

21

asked Feb 6 at 22:02

0 votes

0 answers

93 views

Pyspark dataframe creadted using jdbc, but when tried to write it to a parquet file in hdfs, throws `year out of range` error

Shiva

312

asked Jan 24 at 22:05

0 votes

0 answers

26 views

need to send incremental data to hdfs or hadoop cluster, how to achieve this?

I'm generating system data like cpu_usage, available memory, total memory, used memory, curent time, system name, ip address and etc for my testing. I.m able to achieve this when trying to send the ...

Raman gupta

1

asked Jan 12 at 7:15

0 votes

2 answers

231 views

Accessing HDFS Data with Optimized Locality

I am wondering, how to make sure HDFS data accessing makes the best use of the local replication to minimize the use of network transfer. I am hosting HDFS on 3 machines and replication is set to 3. ...

David Lee

1

asked Dec 25, 2023 at 2:40

0 votes

3 answers

543 views

How can I read hdf5 files stored as 1-D array. and view them as images?

I have a large image classification dataset stored in the format .hdf5. The dataset has the labels and the images stored in the .hdf5 file. I am unable to view the images as they are store in form of ...

Aleph

205

asked Dec 2, 2023 at 7:34

0 votes

1 answer

579 views

Pyarrow read remote host hdfs file

I followed tuto and guide from pyarrow doc but I still can't use correctly the hdfs file system to get file from my remote host. pre-requise: https://arrow.apache.org/docs/11.0/python/filesystems.html#...

thomas

31

asked Oct 19, 2023 at 9:39

0 votes

3 answers

251 views

Read range of parquet files in PySpark

I have a ton of daily files stored in an HDFS where the partitions are stored in YYYY-MM-DD format. For example: $ hdfs dfs -ls /my/path/here <some stuff here> /my/path/here/cutoff_date=2023-10-...

Arturo Sbr

6,293

asked Oct 9, 2023 at 22:13

0 votes

1 answer

98 views

Can't achieve desired directory structure when writing data from Kafka topic to HDFS using PySpark

I am reading data from a Kafka topic and storing the data to HDFS using PySpark. I am trying to achieve the following directory structure in HDFS when I run the code: user/hadoop/OUTPUT_FINAL1/year/...

user31081998

11

asked Jun 12, 2023 at 9:17

0 votes

1 answer

411 views

Cannot import Python hdfs client or config modules from hdfscli

I'm trying to create a Python 3 HDFS client using the hdfscli library. Per the documentation, I've tried import hdfs client = hdfs.config.Config().get_client("dev") which yields ...

mojones101

163

asked Jun 5, 2023 at 19:51

0 votes

0 answers

381 views

How can I zip files in a HDFS directory using PySpark or shell?

For example, I want to turn this unzipped/ ├── file1.csv ├── file2.csv └── file3.csv into this unzipped.zip ├── file1.csv ├── file2.csv └── file3.csv Using ZipFile as below I get the ...

nymarya

21

asked May 29, 2023 at 12:50

0 votes

0 answers

109 views

ConnectionError in connecting my python code which is in windows os to my hdfs which is in linux os

I am new to using hdfs. I have stored couple of datasets there. Now I have a running code in Python which is in my local machine. I want to connect to the hdfs in the linux os in the code itself. I ...

Daremitsu

643

asked May 25, 2023 at 7:22

0 votes

1 answer

478 views

Connection error when trying to download file from Hadoop Namenode using Python and Docker Compose

I have my API running on a docker container. Since it uses several services, I wrapped them all in a docker-compose.yml: version: "2.12" services: api: container_name: api ...

Bob

33

asked May 19, 2023 at 11:33

0 votes

1 answer

106 views

How to write hive partitioned table with pyspark, with skip equal data column?

In my project, I use hadoop-hive with pyspark. my table created by this query. CREATE TABLE target_db.target_table( id string) PARTITIONED BY ( user_name string, category ...

Redwings

560

asked May 14, 2023 at 23:36

0 votes

1 answer

194 views

How to count the number of files and their size on a cluster?

How to count the number of files and their size on a common cluster if the files are created by different users? That is, one user created 10 files, the other 20, the size of the first is 2 GB, the ...

howtoplay112

23

asked Apr 27, 2023 at 10:25

0 votes

1 answer

58 views

Changing a node in hdfs

We have the code uploading config file to HDFS: from hdfs import InsecureClient def upload_file_to_hdfs(local_path, remote_path): client = InsecureClient(url='http://hdfs_server:50070', user='dr....

Павел Иванов

1,913

asked Apr 21, 2023 at 8:20

1 vote

1 answer

207 views

spark addJar from hdfs in a jupyter python notebook

We are running a jupyter notebook connected to a hdfs & spark cluster. Some user need to use a jar lib for a use case that we don't want to deploy for all notebooks. So we don't want to add this ...

Juh_

15.5k

asked Mar 31, 2023 at 9:16

0 votes

1 answer

624 views

Pyspark 3.3.0 dataframe show data but writing CSV creates empty file

Facing a very unusual issue. Dataframe shows data if ran df.show() however, when trying to write as csv, operation completes without error , but writes 0 byte empty file. Is this a bug ? Is there ...

StrangerThinks

248

asked Mar 30, 2023 at 16:26

0 votes

1 answer

334 views

Module/Package resolution in Python

So I have a project directory "dataplatform" and its contents are the follows: ── dataplatform ├── __init__.py ├── commons │ ├── __init__.py │ ├── ...

halfwind22

351

asked Feb 25, 2023 at 20:25

1 vote

1 answer

241 views

Write multiple Avro files from pyspark to the same directory

I'm trying to write out dataframe as Avro files from PySpark dataframe to the path /my/path/ to HDFS, and partition by the col 'partition', so under /my/path/ , there should be the following sub ...

user3735871

307

asked Feb 5, 2023 at 9:27

0 votes

2 answers

407 views

merge multiple csv files present in hadoop into one csv files in local

I have multiple csv files present in hadoop folder. each csv files will have the header present with it. the header will remain the same in each file. I am writing these csv files with the help of ...

Lazy_Nerd

67

asked Nov 8, 2022 at 17:36

0 votes

1 answer

58 views

Batch processing of new hdfs directory not processed yet

I want to apply batch processing on Data available on HDFS directory, it's working by changing manually the path hdfsdir = r"hdfs://VPS-DATA1:9000/directory" filelist = [ line.rsplit(...

Zak_Stack

127

asked Sep 29, 2022 at 14:13

-1 votes

2 answers

407 views

How to Create HDFS file [closed]

I know it is possible to create directory HDFS with python using snakebite But I am looking to create a file on HDFS directory

Zak_Stack

127

asked Sep 22, 2022 at 16:29

0 votes

1 answer

666 views

SaveAsTable error when trying to move table to another path using Spark and Hadoop API

So I am trying to move a table from one path to another, I get an error saying I got Unclosed character class error and can't figure this out. This is my code: import os import subprocess ...

sasson gabai

1

asked Sep 14, 2022 at 9:29

0 votes

0 answers

184 views

Ingest data from local with jupyter notebook into hdfs

I'm trying to ingest data from local to hdfs ( hdfs and jupyter notebook are deployed into kubernetes) : I tired this command in jupyter but it doesn't work: data.to_csv("hdfs:hdfs-namenode-0....

Monica

99

asked Sep 7, 2022 at 10:40

0 votes

1 answer

371 views

Writing to kerberosed hdfs using python | Max retries exceeded with url

I am trying to use python to write to secure hdfs using the following lib link Authentication part: def init_kinit(): kinit_args = ['/usr/bin/kinit', '-kt', '/tmp/xx.keytab', '...

Atheer Abdullatif

272

asked Aug 16, 2022 at 13:28

0 votes

0 answers

445 views

Pandas DataFrame to Impala Table SSL Error

I am trying to connect to Impala Shell so I can create tables from Pandas DataFrames from Cloudera Datascience Workbench based on this blog post: https://netlify--tdhopper.netlify.app/blog/creating-...

matt.aurelio

391

asked Aug 8, 2022 at 15:59

1 vote

2 answers

2k views

How to use pyarrow parquet with multiprocessing

I want to read multiple hdfs files simultaneously using pyarrow and multiprocessing. The simple python script works (see below), but if I try to do the same thing with multiprocessing, then it hangs ...

Sida Zhou

3,695

asked Jul 15, 2022 at 8:55

0 votes

2 answers

1k views

How to connect to hdfs from the docker container?

My goal is to read file from hdfs in airflow and do further manipulations. After researching, I found that url I need to use is as follows: df = pd.read_parquet('http://localhost:9870/webhdfs/v1/...

Val

21

asked Jun 12, 2022 at 18:01

0 votes

0 answers

584 views

Python Hadoop mrjob: subprocess.CalledProcessError: Command returned non-zero exit status 1

I'm using package mrjob on Python3.7 recently. I started hadoop and created an wordaccount.py file, which can calculate the frequency of each word in an .txt file. When I tried to run the file through ...

yamato

235

asked May 22, 2022 at 10:21

0 votes

0 answers

120 views

How to consume an API using Hive data table column as a key?

I have a Hive data table with brazilian zip codes in a column and I need add the full address in another column. I found an API with this information (https://pypi.org/project/pycep-correios/) where I ...

Marcus

1

asked May 16, 2022 at 15:35

1 vote

2 answers

369 views

How to write to HDFS using kedro

I'm trying to output of my Kedro pipeline to the HDFS file system. But I couldn't see on the internet how to do that and on Kedro documents. If anybody configured kedro in catalog please share a ...

Noushad k

11

asked May 4, 2022 at 19:28

0 votes

1 answer

486 views

Spark: CopyToLocal in Cluster Mode

I have a PySpark Script where data is processed and then converted to CSV files. As the end result should be ONE CSV file accessible via WinSCP, I do some additional processing to put the CSV files on ...

Niels

153

asked May 4, 2022 at 14:40

0 votes

0 answers

231 views

how to run mrjob with hdfs on ubuntu?

i am setting hadoop 3.3.1 on ubuntu. I can run jar file with hfds normaly ( use eclipse, add addition jar lib of hadoop then export). run mrjob local normaly but while i running mrjob with hdfs the ...

robocon20x

175

asked Apr 19, 2022 at 4:17

-1 votes

1 answer

685 views

Erreur: HTTPConnectionPool(host='dnode2', port=9864): Max retries exceeded with url: /webhdfs

I'm trying to read a file on my hdfs server in my python app deployed with docker, during dev, I don't have any problem, but in prod there are this error : Erreur: HTTPConnectionPool(host='dnode2', ...

Eboua Osée

19

asked Apr 15, 2022 at 10:57

0 votes

1 answer

856 views

FileNotFoundError: [Errno 2] No such file or directory when put file in hdfs

I use subprocess.popen in python to put files in hdfs. it runs accurately using python on the Windows cmd. but as I use vscode to run the code, I get "FileNotFoundError: [Errno 2] No such file or ...

Muhammadoufi

95

asked Apr 11, 2022 at 16:58

0 votes

0 answers

192 views

Copy file from local to hdfs with %20 in file name Python

I am trying to copy 100s of Json files from local to hdfs using python subprocess. subprocess.call(['hdfs', 'dfs', '-put', mylocalpath, hdfspath ]) This works fine if the name is proper with no ...

Padfoot123

1,117

asked Apr 8, 2022 at 9:34

0 votes

1 answer

80 views

How can we add condition for Date Comparison in AWK

I am giving below command to hdfs dfs -ls "+hdfsdir" | awk '{$6 == '2022-03-07' ; {print $8}' $6 contains date in format 2022-03-07. But when I am executing this query it is giving results ...

Anshuman Madhav

11

asked Apr 6, 2022 at 7:52

0 votes

1 answer

722 views

How to export a Hive table into a CSV file where my file contains CJK words that warns illegal characters

df = sol.spark.sql("SELECT * FROM mytable") df.write.csv("hdfs:///user/athena_ioc/mydata.csv") I am using pyspark in this case, so here i am using spark dataframe where it cannot ...

ziyang liu

1

asked Mar 11, 2022 at 6:32

0 votes

1 answer

199 views

Can pyhdfs make a 'soft' delete?

I am using from pyhdfs import HdfsClient fs = HdfsClient(hosts=..., user_name='hdfs', ..) fs.delete(path_table, recursive=True) However, after I deleted the directory, I could not find it in the ...

user2894829

815

asked Mar 10, 2022 at 7:44

1 vote

0 answers

658 views

How to load a SPARK NLP pretrained pipeline through HDFS

I've already installed sparknlp and its assembly jars, but I still get an error when I try to use one of the models, I get a TypeError: 'JavaPackage' object is not callable. I cannot install the model ...

LucasA

11

asked Mar 6, 2022 at 11:38

1 vote

1 answer

992 views

Running a python script on executors in a cluster [Scala/Spark]

I have python script: import sys for line in sys.stdin: print("hello " + line) And I run it on workers in cluster: def run(spark: SparkSession) = { val data = List("john",&...

Mardaunt

92

asked Feb 22, 2022 at 0:10

1 vote

1 answer

39 views

How many types of HDFS Clusters are there and what is the best way to connect to HDFS Cluster using Python

I think the title pretty much sums up my requirement, I would appreciate it if anyone please post how many types of HDFS clusters (Kerberos, etc.) and also which is the best library that is used to ...

nikhil int

309

asked Feb 3, 2022 at 13:40

0 votes

0 answers

247 views

How to connect to HDFS Cluster using snakebite-py3

I am trying to connect to an HDFS Cluster using python code, library(snakebite-py3) and I see that when I set use_sasl to True I am getting the following error: Code Snippet: from snakebite.client ...

nikhil int

309

asked Feb 3, 2022 at 12:00

Collectives™ on Stack Overflow

All Questions

Related Tags