Skip to main content

All Questions

Tagged with
Filter by
Sorted by
Tagged with
0 votes
0 answers
25 views

How to normally output a path containing non-English characters in Popen

it's my function code ~~~~sql DO $$ from subprocess import PIPE,Popen import os import sys env = os.environ.copy() reload(sys) sys.setdefaultencoding('utf-8') plpy.notice(sys....
mateo's user avatar
  • 89
0 votes
0 answers
27 views

issues putting/copying local file to hdfs in docker

To put in context, I am working on a solution that requires data collection using kafka, and then storing that data into HDFS. The producer part in kafka is already working, but when it comes to the ...
Assou Iben jellal's user avatar
1 vote
0 answers
38 views

How to extract date part from folder name and move it to another folder on hdfs using pyspark

I currently have folders and sub-folders in day-wise structure in this path '/dev/data/' 2024.03.30 part-00001.avro part-00002.avro 2024.03.31 part-00001.avro part-00002.avro 2024.04.01 ...
user175025's user avatar
1 vote
0 answers
534 views

Spark execute "Listing leaf files and directories" while reading partitioned dataset on HDFS

I have a dataset on HDFS, and it's partitioned on 2 columns: ym, ymd, and eventName. While I read the dataset using root path, Spark create jobs to list all leaf files on the directory. The problem is ...
gohan's user avatar
  • 87
0 votes
0 answers
60 views

How can I read a pickle file in HDFS?

I have developed a deep learning model using PyTorch, then I saved it as a pickle file. Now, I am working on HDFS and I have a table named XX. I want to use my deep learning model to get a result ...
Ilyes Djerfaf's user avatar
0 votes
0 answers
138 views

ClickHouse Server Exception: Code: 210.DB::Exception: Fail to read from HDFS:

I'm trying to migrate data from hdfs to clickhouse from hdfs. Sometimes the showcase is assembled without problems, but most often it falls with such an error. I tried to wrap it in try except so that ...
howtoplay112's user avatar
0 votes
0 answers
64 views

How do I directly read files from HDFS using dask?

I have a dask script where I convert a sas7bdat file, using the libraries dask-yarn to deploy to a YARN cluster, and dask_sas_reader for the conversion. The dask_sas_reader depends on pyreadstat in ...
aerylias's user avatar
0 votes
0 answers
43 views

hdfs library will not load in an HDinsight jupyter notebook

I have a problem in the HDInsight Jupyter Notebook. I cannot access outside files. I am trying to access files on the HDInsight cluster head node which I can ssh to using my username and password from ...
M C's user avatar
  • 21
0 votes
0 answers
93 views

Pyspark dataframe creadted using jdbc, but when tried to write it to a parquet file in hdfs, throws `year out of range` error

I am trying to do data ingestion from Oracle to Hive DB(hdfs) using Pyspark. The source table has data row as given: LOAD_TIMESTAMP| OPERATION| COMMIT_TIMESTAMP|KSTOKEY| LPOLNUM|LSYSTYP| XPOLVSN| ...
Shiva's user avatar
  • 312
0 votes
0 answers
26 views

need to send incremental data to hdfs or hadoop cluster, how to achieve this?

I'm generating system data like cpu_usage, available memory, total memory, used memory, curent time, system name, ip address and etc for my testing. I.m able to achieve this when trying to send the ...
Raman gupta's user avatar
0 votes
2 answers
231 views

Accessing HDFS Data with Optimized Locality

I am wondering, how to make sure HDFS data accessing makes the best use of the local replication to minimize the use of network transfer. I am hosting HDFS on 3 machines and replication is set to 3. ...
David Lee's user avatar
0 votes
3 answers
543 views

How can I read hdf5 files stored as 1-D array. and view them as images?

I have a large image classification dataset stored in the format .hdf5. The dataset has the labels and the images stored in the .hdf5 file. I am unable to view the images as they are store in form of ...
Aleph's user avatar
  • 205
0 votes
1 answer
579 views

Pyarrow read remote host hdfs file

I followed tuto and guide from pyarrow doc but I still can't use correctly the hdfs file system to get file from my remote host. pre-requise: https://arrow.apache.org/docs/11.0/python/filesystems.html#...
thomas's user avatar
  • 31
0 votes
3 answers
251 views

Read range of parquet files in PySpark

I have a ton of daily files stored in an HDFS where the partitions are stored in YYYY-MM-DD format. For example: $ hdfs dfs -ls /my/path/here <some stuff here> /my/path/here/cutoff_date=2023-10-...
Arturo Sbr's user avatar
  • 6,293
0 votes
1 answer
98 views

Can't achieve desired directory structure when writing data from Kafka topic to HDFS using PySpark

I am reading data from a Kafka topic and storing the data to HDFS using PySpark. I am trying to achieve the following directory structure in HDFS when I run the code: user/hadoop/OUTPUT_FINAL1/year/...
user31081998's user avatar
0 votes
1 answer
411 views

Cannot import Python hdfs client or config modules from hdfscli

I'm trying to create a Python 3 HDFS client using the hdfscli library. Per the documentation, I've tried import hdfs client = hdfs.config.Config().get_client("dev") which yields ...
mojones101's user avatar
0 votes
0 answers
381 views

How can I zip files in a HDFS directory using PySpark or shell?

For example, I want to turn this unzipped/ ├── file1.csv ├── file2.csv └── file3.csv into this unzipped.zip ├── file1.csv ├── file2.csv └── file3.csv Using ZipFile as below I get the ...
nymarya's user avatar
  • 21
0 votes
0 answers
109 views

ConnectionError in connecting my python code which is in windows os to my hdfs which is in linux os

I am new to using hdfs. I have stored couple of datasets there. Now I have a running code in Python which is in my local machine. I want to connect to the hdfs in the linux os in the code itself. I ...
Daremitsu's user avatar
  • 643
0 votes
1 answer
478 views

Connection error when trying to download file from Hadoop Namenode using Python and Docker Compose

I have my API running on a docker container. Since it uses several services, I wrapped them all in a docker-compose.yml: version: "2.12" services: api: container_name: api ...
Bob's user avatar
  • 33
0 votes
1 answer
106 views

How to write hive partitioned table with pyspark, with skip equal data column?

In my project, I use hadoop-hive with pyspark. my table created by this query. CREATE TABLE target_db.target_table( id string) PARTITIONED BY ( user_name string, category ...
Redwings's user avatar
  • 560
0 votes
1 answer
194 views

How to count the number of files and their size on a cluster?

How to count the number of files and their size on a common cluster if the files are created by different users? That is, one user created 10 files, the other 20, the size of the first is 2 GB, the ...
howtoplay112's user avatar
0 votes
1 answer
58 views

Changing a node in hdfs

We have the code uploading config file to HDFS: from hdfs import InsecureClient def upload_file_to_hdfs(local_path, remote_path): client = InsecureClient(url='http://hdfs_server:50070', user='dr....
Павел Иванов's user avatar
1 vote
1 answer
207 views

spark addJar from hdfs in a jupyter python notebook

We are running a jupyter notebook connected to a hdfs & spark cluster. Some user need to use a jar lib for a use case that we don't want to deploy for all notebooks. So we don't want to add this ...
Juh_'s user avatar
  • 15.5k
0 votes
1 answer
624 views

Pyspark 3.3.0 dataframe show data but writing CSV creates empty file

Facing a very unusual issue. Dataframe shows data if ran df.show() however, when trying to write as csv, operation completes without error , but writes 0 byte empty file. Is this a bug ? Is there ...
StrangerThinks's user avatar
0 votes
1 answer
334 views

Module/Package resolution in Python

So I have a project directory "dataplatform" and its contents are the follows: ── dataplatform ├── __init__.py ├── commons │   ├── __init__.py │   ├── ...
halfwind22's user avatar
1 vote
1 answer
241 views

Write multiple Avro files from pyspark to the same directory

I'm trying to write out dataframe as Avro files from PySpark dataframe to the path /my/path/ to HDFS, and partition by the col 'partition', so under /my/path/ , there should be the following sub ...
user3735871's user avatar
0 votes
2 answers
407 views

merge multiple csv files present in hadoop into one csv files in local

I have multiple csv files present in hadoop folder. each csv files will have the header present with it. the header will remain the same in each file. I am writing these csv files with the help of ...
Lazy_Nerd's user avatar
0 votes
1 answer
58 views

Batch processing of new hdfs directory not processed yet

I want to apply batch processing on Data available on HDFS directory, it's working by changing manually the path hdfsdir = r"hdfs://VPS-DATA1:9000/directory" filelist = [ line.rsplit(...
Zak_Stack's user avatar
  • 127
-1 votes
2 answers
407 views

How to Create HDFS file [closed]

I know it is possible to create directory HDFS with python using snakebite But I am looking to create a file on HDFS directory
Zak_Stack's user avatar
  • 127
0 votes
1 answer
666 views

SaveAsTable error when trying to move table to another path using Spark and Hadoop API

So I am trying to move a table from one path to another, I get an error saying I got Unclosed character class error and can't figure this out. This is my code: import os import subprocess ...
sasson gabai's user avatar
0 votes
0 answers
184 views

Ingest data from local with jupyter notebook into hdfs

I'm trying to ingest data from local to hdfs ( hdfs and jupyter notebook are deployed into kubernetes) : I tired this command in jupyter but it doesn't work: data.to_csv("hdfs:hdfs-namenode-0....
Monica's user avatar
  • 99
0 votes
1 answer
371 views

Writing to kerberosed hdfs using python | Max retries exceeded with url

I am trying to use python to write to secure hdfs using the following lib link Authentication part: def init_kinit(): kinit_args = ['/usr/bin/kinit', '-kt', '/tmp/xx.keytab', '...
Atheer Abdullatif's user avatar
0 votes
0 answers
445 views

Pandas DataFrame to Impala Table SSL Error

I am trying to connect to Impala Shell so I can create tables from Pandas DataFrames from Cloudera Datascience Workbench based on this blog post: https://netlify--tdhopper.netlify.app/blog/creating-...
matt.aurelio's user avatar
1 vote
2 answers
2k views

How to use pyarrow parquet with multiprocessing

I want to read multiple hdfs files simultaneously using pyarrow and multiprocessing. The simple python script works (see below), but if I try to do the same thing with multiprocessing, then it hangs ...
Sida Zhou's user avatar
  • 3,695
0 votes
2 answers
1k views

How to connect to hdfs from the docker container?

My goal is to read file from hdfs in airflow and do further manipulations. After researching, I found that url I need to use is as follows: df = pd.read_parquet('http://localhost:9870/webhdfs/v1/...
Val's user avatar
  • 21
0 votes
0 answers
584 views

Python Hadoop mrjob: subprocess.CalledProcessError: Command returned non-zero exit status 1

I'm using package mrjob on Python3.7 recently. I started hadoop and created an wordaccount.py file, which can calculate the frequency of each word in an .txt file. When I tried to run the file through ...
yamato's user avatar
  • 235
0 votes
0 answers
120 views

How to consume an API using Hive data table column as a key?

I have a Hive data table with brazilian zip codes in a column and I need add the full address in another column. I found an API with this information (https://pypi.org/project/pycep-correios/) where I ...
Marcus's user avatar
  • 1
1 vote
2 answers
369 views

How to write to HDFS using kedro

I'm trying to output of my Kedro pipeline to the HDFS file system. But I couldn't see on the internet how to do that and on Kedro documents. If anybody configured kedro in catalog please share a ...
Noushad k's user avatar
0 votes
1 answer
486 views

Spark: CopyToLocal in Cluster Mode

I have a PySpark Script where data is processed and then converted to CSV files. As the end result should be ONE CSV file accessible via WinSCP, I do some additional processing to put the CSV files on ...
Niels's user avatar
  • 153
0 votes
0 answers
231 views

how to run mrjob with hdfs on ubuntu?

i am setting hadoop 3.3.1 on ubuntu. I can run jar file with hfds normaly ( use eclipse, add addition jar lib of hadoop then export). run mrjob local normaly but while i running mrjob with hdfs the ...
robocon20x's user avatar
-1 votes
1 answer
685 views

Erreur: HTTPConnectionPool(host='dnode2', port=9864): Max retries exceeded with url: /webhdfs

I'm trying to read a file on my hdfs server in my python app deployed with docker, during dev, I don't have any problem, but in prod there are this error : Erreur: HTTPConnectionPool(host='dnode2', ...
Eboua Osée's user avatar
0 votes
1 answer
856 views

FileNotFoundError: [Errno 2] No such file or directory when put file in hdfs

I use subprocess.popen in python to put files in hdfs. it runs accurately using python on the Windows cmd. but as I use vscode to run the code, I get "FileNotFoundError: [Errno 2] No such file or ...
Muhammadoufi's user avatar
0 votes
0 answers
192 views

Copy file from local to hdfs with %20 in file name Python

I am trying to copy 100s of Json files from local to hdfs using python subprocess. subprocess.call(['hdfs', 'dfs', '-put', mylocalpath, hdfspath ]) This works fine if the name is proper with no ...
Padfoot123's user avatar
  • 1,117
0 votes
1 answer
80 views

How can we add condition for Date Comparison in AWK

I am giving below command to hdfs dfs -ls "+hdfsdir" | awk '{$6 == '2022-03-07' ; {print $8}' $6 contains date in format 2022-03-07. But when I am executing this query it is giving results ...
Anshuman Madhav's user avatar
0 votes
1 answer
722 views

How to export a Hive table into a CSV file where my file contains CJK words that warns illegal characters

df = sol.spark.sql("SELECT * FROM mytable") df.write.csv("hdfs:///user/athena_ioc/mydata.csv") I am using pyspark in this case, so here i am using spark dataframe where it cannot ...
ziyang liu's user avatar
0 votes
1 answer
199 views

Can pyhdfs make a 'soft' delete?

I am using from pyhdfs import HdfsClient fs = HdfsClient(hosts=..., user_name='hdfs', ..) fs.delete(path_table, recursive=True) However, after I deleted the directory, I could not find it in the ...
user2894829's user avatar
1 vote
0 answers
658 views

How to load a SPARK NLP pretrained pipeline through HDFS

I've already installed sparknlp and its assembly jars, but I still get an error when I try to use one of the models, I get a TypeError: 'JavaPackage' object is not callable. I cannot install the model ...
LucasA's user avatar
  • 11
1 vote
1 answer
992 views

Running a python script on executors in a cluster [Scala/Spark]

I have python script: import sys for line in sys.stdin: print("hello " + line) And I run it on workers in cluster: def run(spark: SparkSession) = { val data = List("john",&...
Mardaunt's user avatar
1 vote
1 answer
39 views

How many types of HDFS Clusters are there and what is the best way to connect to HDFS Cluster using Python

I think the title pretty much sums up my requirement, I would appreciate it if anyone please post how many types of HDFS clusters (Kerberos, etc.) and also which is the best library that is used to ...
nikhil int's user avatar
0 votes
0 answers
247 views

How to connect to HDFS Cluster using snakebite-py3

I am trying to connect to an HDFS Cluster using python code, library(snakebite-py3) and I see that when I set use_sasl to True I am getting the following error: Code Snippet: from snakebite.client ...
nikhil int's user avatar

1
2 3 4 5
8