All Questions
372 questions
0
votes
0
answers
25
views
How to normally output a path containing non-English characters in Popen
it's my function code
~~~~sql
DO
$$
from subprocess import PIPE,Popen
import os
import sys
env = os.environ.copy()
reload(sys)
sys.setdefaultencoding('utf-8')
plpy.notice(sys....
0
votes
0
answers
27
views
issues putting/copying local file to hdfs in docker
To put in context, I am working on a solution that requires data collection using kafka, and then storing that data into HDFS. The producer part in kafka is already working, but when it comes to the ...
1
vote
0
answers
38
views
How to extract date part from folder name and move it to another folder on hdfs using pyspark
I currently have folders and sub-folders in day-wise structure in this path '/dev/data/'
2024.03.30
part-00001.avro
part-00002.avro
2024.03.31
part-00001.avro
part-00002.avro
2024.04.01
...
1
vote
0
answers
534
views
Spark execute "Listing leaf files and directories" while reading partitioned dataset on HDFS
I have a dataset on HDFS, and it's partitioned on 2 columns: ym, ymd, and eventName. While I read the dataset using root path, Spark create jobs to list all leaf files on the directory. The problem is ...
0
votes
0
answers
60
views
How can I read a pickle file in HDFS?
I have developed a deep learning model using PyTorch, then I saved it as a pickle file.
Now, I am working on HDFS and I have a table named XX. I want to use my deep learning model to get a result ...
0
votes
0
answers
138
views
ClickHouse Server Exception: Code: 210.DB::Exception: Fail to read from HDFS:
I'm trying to migrate data from hdfs to clickhouse from hdfs. Sometimes the showcase is assembled without problems, but most often it falls with such an error. I tried to wrap it in try except so that ...
0
votes
0
answers
64
views
How do I directly read files from HDFS using dask?
I have a dask script where I convert a sas7bdat file, using the libraries dask-yarn to deploy to a YARN cluster, and dask_sas_reader for the conversion. The dask_sas_reader depends on pyreadstat in ...
0
votes
0
answers
43
views
hdfs library will not load in an HDinsight jupyter notebook
I have a problem in the HDInsight Jupyter Notebook.
I cannot access outside files. I am trying to access files on the HDInsight cluster head node which I can ssh to using my username and password from ...
0
votes
0
answers
93
views
Pyspark dataframe creadted using jdbc, but when tried to write it to a parquet file in hdfs, throws `year out of range` error
I am trying to do data ingestion from Oracle to Hive DB(hdfs) using Pyspark.
The source table has data row as given:
LOAD_TIMESTAMP| OPERATION| COMMIT_TIMESTAMP|KSTOKEY| LPOLNUM|LSYSTYP| XPOLVSN| ...
0
votes
0
answers
26
views
need to send incremental data to hdfs or hadoop cluster, how to achieve this?
I'm generating system data like cpu_usage, available memory, total memory, used memory, curent time, system name, ip address and etc for my testing. I.m able to achieve this when trying to send the ...
0
votes
2
answers
231
views
Accessing HDFS Data with Optimized Locality
I am wondering, how to make sure HDFS data accessing makes the best use of the local replication to minimize the use of network transfer.
I am hosting HDFS on 3 machines and replication is set to 3. ...
0
votes
3
answers
543
views
How can I read hdf5 files stored as 1-D array. and view them as images?
I have a large image classification dataset stored in the format .hdf5. The dataset has the labels and the images stored in the .hdf5 file. I am unable to view the images as they are store in form of ...
0
votes
1
answer
579
views
Pyarrow read remote host hdfs file
I followed tuto and guide from pyarrow doc but I still can't use correctly the hdfs file system to get file from my remote host.
pre-requise: https://arrow.apache.org/docs/11.0/python/filesystems.html#...
0
votes
3
answers
251
views
Read range of parquet files in PySpark
I have a ton of daily files stored in an HDFS where the partitions are stored in YYYY-MM-DD format.
For example:
$ hdfs dfs -ls /my/path/here
<some stuff here> /my/path/here/cutoff_date=2023-10-...
0
votes
1
answer
98
views
Can't achieve desired directory structure when writing data from Kafka topic to HDFS using PySpark
I am reading data from a Kafka topic and storing the data to HDFS using PySpark. I am trying to achieve the following directory structure in HDFS when I run the code:
user/hadoop/OUTPUT_FINAL1/year/...
0
votes
1
answer
411
views
Cannot import Python hdfs client or config modules from hdfscli
I'm trying to create a Python 3 HDFS client using the hdfscli library. Per the documentation, I've tried
import hdfs
client = hdfs.config.Config().get_client("dev")
which yields ...
0
votes
0
answers
381
views
How can I zip files in a HDFS directory using PySpark or shell?
For example, I want to turn this
unzipped/
├── file1.csv
├── file2.csv
└── file3.csv
into this
unzipped.zip
├── file1.csv
├── file2.csv
└── file3.csv
Using ZipFile as below I get the ...
0
votes
0
answers
109
views
ConnectionError in connecting my python code which is in windows os to my hdfs which is in linux os
I am new to using hdfs. I have stored couple of datasets there. Now I have a running code in Python which is in my local machine. I want to connect to the hdfs in the linux os in the code itself.
I ...
0
votes
1
answer
478
views
Connection error when trying to download file from Hadoop Namenode using Python and Docker Compose
I have my API running on a docker container. Since it uses several services, I wrapped them all in a docker-compose.yml:
version: "2.12"
services:
api:
container_name: api
...
0
votes
1
answer
106
views
How to write hive partitioned table with pyspark, with skip equal data column?
In my project, I use hadoop-hive with pyspark.
my table created by this query.
CREATE TABLE target_db.target_table(
id string)
PARTITIONED BY (
user_name string,
category ...
0
votes
1
answer
194
views
How to count the number of files and their size on a cluster?
How to count the number of files and their size on a common cluster if the files are created by different users? That is, one user created 10 files, the other 20, the size of the first is 2 GB, the ...
0
votes
1
answer
58
views
Changing a node in hdfs
We have the code uploading config file to HDFS:
from hdfs import InsecureClient
def upload_file_to_hdfs(local_path, remote_path):
client = InsecureClient(url='http://hdfs_server:50070', user='dr....
1
vote
1
answer
207
views
spark addJar from hdfs in a jupyter python notebook
We are running a jupyter notebook connected to a hdfs & spark cluster. Some user need to use a jar lib for a use case that we don't want to deploy for all notebooks. So we don't want to add this ...
0
votes
1
answer
624
views
Pyspark 3.3.0 dataframe show data but writing CSV creates empty file
Facing a very unusual issue. Dataframe shows data if ran df.show() however, when trying to write as csv, operation completes without error , but writes 0 byte empty file.
Is this a bug ? Is there ...
0
votes
1
answer
334
views
Module/Package resolution in Python
So I have a project directory "dataplatform" and its contents are the follows:
── dataplatform
├── __init__.py
├── commons
│ ├── __init__.py
│ ├── ...
1
vote
1
answer
241
views
Write multiple Avro files from pyspark to the same directory
I'm trying to write out dataframe as Avro files from PySpark dataframe to the path /my/path/ to HDFS, and partition by the col 'partition', so under /my/path/ , there should be the following sub ...
0
votes
2
answers
407
views
merge multiple csv files present in hadoop into one csv files in local
I have multiple csv files present in hadoop folder. each csv files will have the header present with it. the header will remain the same in each file.
I am writing these csv files with the help of ...
0
votes
1
answer
58
views
Batch processing of new hdfs directory not processed yet
I want to apply batch processing on Data available on HDFS directory,
it's working by changing manually the path
hdfsdir = r"hdfs://VPS-DATA1:9000/directory"
filelist = [ line.rsplit(...
-1
votes
2
answers
407
views
How to Create HDFS file [closed]
I know it is possible to create directory HDFS with python using snakebite
But I am looking to create a file on HDFS directory
0
votes
1
answer
666
views
SaveAsTable error when trying to move table to another path using Spark and Hadoop API
So I am trying to move a table from one path to another,
I get an error saying I got Unclosed character class error and can't figure this out.
This is my code:
import os
import subprocess
...
0
votes
0
answers
184
views
Ingest data from local with jupyter notebook into hdfs
I'm trying to ingest data from local to hdfs ( hdfs and jupyter notebook are deployed into kubernetes)
:
I tired this command in jupyter but it doesn't work:
data.to_csv("hdfs:hdfs-namenode-0....
0
votes
1
answer
371
views
Writing to kerberosed hdfs using python | Max retries exceeded with url
I am trying to use python to write to secure hdfs using the following lib link
Authentication part:
def init_kinit():
kinit_args = ['/usr/bin/kinit', '-kt', '/tmp/xx.keytab',
'...
0
votes
0
answers
445
views
Pandas DataFrame to Impala Table SSL Error
I am trying to connect to Impala Shell so I can create tables from Pandas DataFrames from Cloudera Datascience Workbench
based on this blog post:
https://netlify--tdhopper.netlify.app/blog/creating-...
1
vote
2
answers
2k
views
How to use pyarrow parquet with multiprocessing
I want to read multiple hdfs files simultaneously using pyarrow and multiprocessing.
The simple python script works (see below), but if I try to do the same thing with multiprocessing, then it hangs ...
0
votes
2
answers
1k
views
How to connect to hdfs from the docker container?
My goal is to read file from hdfs in airflow and do further manipulations.
After researching, I found that url I need to use is as follows:
df = pd.read_parquet('http://localhost:9870/webhdfs/v1/...
0
votes
0
answers
584
views
Python Hadoop mrjob: subprocess.CalledProcessError: Command returned non-zero exit status 1
I'm using package mrjob on Python3.7 recently. I started hadoop and created an wordaccount.py file, which can calculate the frequency of each word in an .txt file. When I tried to run the file through ...
0
votes
0
answers
120
views
How to consume an API using Hive data table column as a key?
I have a Hive data table with brazilian zip codes in a column and I need add the full address in another column.
I found an API with this information (https://pypi.org/project/pycep-correios/) where I ...
1
vote
2
answers
369
views
How to write to HDFS using kedro
I'm trying to output of my Kedro pipeline to the HDFS file system. But I couldn't see on the internet how to do that and on Kedro documents. If anybody configured kedro in catalog please share a ...
0
votes
1
answer
486
views
Spark: CopyToLocal in Cluster Mode
I have a PySpark Script where data is processed and then converted to CSV files. As the end result should be ONE CSV file accessible via WinSCP, I do some additional processing to put the CSV files on ...
0
votes
0
answers
231
views
how to run mrjob with hdfs on ubuntu?
i am setting hadoop 3.3.1 on ubuntu. I can run jar file with hfds normaly ( use eclipse, add addition jar lib of hadoop then export). run mrjob local normaly but while i running mrjob with hdfs the ...
-1
votes
1
answer
685
views
Erreur: HTTPConnectionPool(host='dnode2', port=9864): Max retries exceeded with url: /webhdfs
I'm trying to read a file on my hdfs server in my python app deployed with docker, during dev, I don't have any problem, but in prod there are this error :
Erreur: HTTPConnectionPool(host='dnode2', ...
0
votes
1
answer
856
views
FileNotFoundError: [Errno 2] No such file or directory when put file in hdfs
I use subprocess.popen in python to put files in hdfs. it runs accurately using python on the Windows cmd. but as I use vscode to run the code, I get "FileNotFoundError: [Errno 2] No such file or ...
0
votes
0
answers
192
views
Copy file from local to hdfs with %20 in file name Python
I am trying to copy 100s of Json files from local to hdfs using python subprocess.
subprocess.call(['hdfs', 'dfs', '-put', mylocalpath, hdfspath ])
This works fine if the name is proper with no ...
0
votes
1
answer
80
views
How can we add condition for Date Comparison in AWK
I am giving below command to
hdfs dfs -ls "+hdfsdir" | awk '{$6 == '2022-03-07' ; {print $8}'
$6 contains date in format 2022-03-07.
But when I am executing this query it is giving results ...
0
votes
1
answer
722
views
How to export a Hive table into a CSV file where my file contains CJK words that warns illegal characters
df = sol.spark.sql("SELECT * FROM mytable")
df.write.csv("hdfs:///user/athena_ioc/mydata.csv")
I am using pyspark in this case, so here i am using spark dataframe where it cannot ...
0
votes
1
answer
199
views
Can pyhdfs make a 'soft' delete?
I am using
from pyhdfs import HdfsClient
fs = HdfsClient(hosts=..., user_name='hdfs', ..)
fs.delete(path_table, recursive=True)
However, after I deleted the directory, I could not find it in the ...
1
vote
0
answers
658
views
How to load a SPARK NLP pretrained pipeline through HDFS
I've already installed sparknlp and its assembly jars, but I still get an error when I try to use one of the models, I get a TypeError: 'JavaPackage' object is not callable.
I cannot install the model ...
1
vote
1
answer
992
views
Running a python script on executors in a cluster [Scala/Spark]
I have python script:
import sys
for line in sys.stdin:
print("hello " + line)
And I run it on workers in cluster:
def run(spark: SparkSession) = {
val data = List("john",&...
1
vote
1
answer
39
views
How many types of HDFS Clusters are there and what is the best way to connect to HDFS Cluster using Python
I think the title pretty much sums up my requirement, I would appreciate it if anyone please post how many types of HDFS clusters (Kerberos, etc.) and also which is the best library that is used to ...
0
votes
0
answers
247
views
How to connect to HDFS Cluster using snakebite-py3
I am trying to connect to an HDFS Cluster using python code, library(snakebite-py3) and I see that when I set use_sasl to True I am getting the following error:
Code Snippet:
from snakebite.client ...