185 questions
0
votes
1
answer
56
views
UnsupportedFileSystemException after spring boot starter parent version upgrade to 3.2 or 3.3
We have Spring boot application service which call Hadoop distcp API.
Application working fine with Java 17, Spring-boot-starter-parent 3.1.0 and Hadoop version 3.3.6
When we changed spring-boot-...
0
votes
0
answers
36
views
Redirecting distcp output to log4j from stdout
I am currently running distcp using org.apache.hadoop.util.ToolRunner.run and the output is being written to the console.
DistCpOptions opts = getDistCpOptions(srcPath, dstPath);
private ...
0
votes
0
answers
55
views
distcp throws java.io.IOException when copying files
I'm trying to copy 20TB of HDFS files from one cluster to another, using this single distcp command:
hadoop distcp
-Dmapreduce.map.memory.mb=16834
-Dmapreduce.reduce.memory.mb=32768
-Dyarn.app....
0
votes
0
answers
33
views
How to copy data from an old Hive cluster to a new Hive cluster and keep all Hive metastore meta data?
I have two different Apache Hive clusters and I need to transfer data from the old to the new. What is the best way to accomplish this? The old cluster is using hive 1.2 with hadoop 2.7 and the new is ...
0
votes
0
answers
69
views
s3-dist-cp groupby Regex Capture
I'm using EMR to combine hundreds of thousands of very small (1-5) row csv files. I want to concatenate them into around 100MB files so they are easier to work with.
My EMR job uses command-runner.jar ...
0
votes
0
answers
288
views
Fastest way to copy large data from HDFS location to GCP bucket using command
I have a 5TB of data which need to transfer to GCP bucket using some command.
I tried using hadoop discp -m num -strategy dynamic source_path destination_path. It's still getting executed since long.
...
0
votes
1
answer
181
views
GCS Connector on EMR failing with java.lang.ClassNotFoundException
I have created an emr cluster with the instructions on how to create a connection from gcs provided here and keep running the hadoop distcp command.
It keeps failing with the following error:
2023-07-...
0
votes
0
answers
267
views
How to specify a filter to exclude hdfs file of a partition when calling distcp -update?
Hadoop distcp -update is used with a filter. I want to exclude the hdfs file of the partition dt=20230621. What should I do? The command I am using now is
$ hadoop distcp -update -append -filters &...
0
votes
0
answers
503
views
requested yarn user xxx not found when running distcp from hadoop to GCS
when i try and run the distcp command
hadoop distcp /user/a.txt gs://user/a.txt
i get the message
ERROR tools.DistCp: Exception encountered
main : run as user is xxx
main : requested yarn user is ...
1
vote
0
answers
419
views
Hadoop distcp does not skip CRC checks
I have an issue with skipping CRC checks between source and target paths running distcp.
I copy and decrypt files on demand and their checksum is different, that is expected.
My command looks like ...
0
votes
0
answers
227
views
Hadoop distcp to S3
I am using Hadoop distcp command to move data from hdfs to s3. Recently after hadoop cdh to cdp upgrade I am facing a difference in -update option. Previously-update will move files with same file ...
0
votes
1
answer
129
views
distcp one table to another table with different name
Note: table is name of hive table, which is HDFS directory.
I have two servers, C1 and C2. C1 has a table item.name with sequential format.
C2 has a table item.name with orc format, which has same ...
0
votes
2
answers
890
views
distcp - copy data from cloudera hdfs to cloud storage
I am trying to replicate data between hdfs and my gcp cloud storage. This is not one time data copy. After first copy, I want copy only new files, updates files. and if files are deleted on on-prem it ...
0
votes
1
answer
358
views
Hadoop distcp: what ports are used?
If I want to use distCp on an on-prem hadoop cluster, so it can 'push' data to external cloud storage, what firewall considerations must be made in order to leverage this tool? What ports does the ...
0
votes
1
answer
364
views
DistCP - Even simple copies result in CRC Exceptions
I'm running into an issue using distcp to copy files - every copy fails with an IO Exception (Checksum mismatch), even if performing a simple copy within the cluster (i.e. hadoop distcp -pbugctrx /foo/...
0
votes
0
answers
458
views
How should I move files/folders from one Hadoop cluster to another and delete the source contents after moving?
Using hadoop's distcp command I am able to move the files across clusters but my requirement is after moving it should delete the contents from the source.
hadoop distcp -update -delete -strategy ...
0
votes
1
answer
809
views
One single distcp command to upload several files to s3 (NO DIRECTORY)
I am currently working with the s3a adapter of Hadoop/HDFS to allow me to upload a number of files from a Hive database to a particular s3 bucket. I'm getting nervous because I can't find anything ...
0
votes
1
answer
708
views
How can I get distcp failed files and replay the task?
I have distcp a file between two hdfs cluster with same version,when I execute failed ,I want to find the failed mapreduce task and related file path,then replay.
0
votes
2
answers
191
views
How to grab all hive files after a certain date for s3 upload (python)
I'm writing a program for a daily upload to s3 of all our hive tables from a particular db.
This database contains records from many years ago, however, and is way too large for a full copy/distcp.
I ...
0
votes
1
answer
66
views
HDFS: yarn export kills when some datanodes run out of space but there are still nodes with plenty of space
We have 2 Hadoop clusters. We want to export an Hbase snapshot from one cluster to the other. The target cluster is composed of 3 datanodes of 128TB and 5 datanodes of 28TB each. Everything goes ...
6
votes
0
answers
300
views
issue when copying hive partitioned table from one cluster to another
I run a distcp command to copy the hdfs location of a table to another cluster.
The copy is scheduled to run every 8 hours.
I run the 'msck repair table' command but not always after the copy.
I have ...
2
votes
0
answers
281
views
Error in accessing google cloud storage bucket via hadoop fs -ls that runs on Cloudera Hadoop CDH 6.3.3 integrated with Kerberos/SSL/LDAP cluster
I am getting the below error while accessing a Google Cloud Storage bucket for the first time via Cloudera CDH 6.3.3 Hadoop Cluster. I am running the command on the edge node where Google Cloud SDK is ...
0
votes
1
answer
493
views
Is it possible to write directly to final file with distcp?
I'm trying to upload file to s3a using distcp.
distcp writes to temporary file first and than renames it to proper filename.
But user does not allow for update/delete. So I have file with proper size, ...
1
vote
0
answers
198
views
Unable to copy HDFS data to S3 bucket
I have an issue related to a similar question asked before. I'm unable to copy data from HDFS to an S3 bucket in IBM Cloud.
I use command: hadoop distcp hdfs://namenode:9000/user/root/data/ s3a://hdfs-...
0
votes
0
answers
648
views
Distcp in Kerberized Cloudera Cluster throws java.io.IOException: Can't get Master Kerberos principal for use as renewer
I am performing distcp operation to copy the files from one directory to another in same cluster. It is a Kerberized Cloudera Hadoop Cluster.
Command I ran:
hadoop distcp -overwrite hdfs://nameservice/...
-1
votes
2
answers
2k
views
Copying data from one s3 bucket to another s3 bucket of different account in fast manner, just using access_id, secret_access_key cred of both
I have access_key, access_id for both of the aws bucket belong to a different account. I have to copy data from one location to another, is there a way to do it faster.
I have tried map-reduced-based ...
0
votes
1
answer
236
views
Does node with mapr client need to have an access to the files I want to copy with distcp?
Situation:
Node 0: mapr client installed, not a part of a cluster, no external resources mounted
Node 1 to 10 : mapr cluster nodes with mapr NodeManager installed. Every node has mounted external ...
2
votes
1
answer
589
views
Hadoop distcp copy from on prem to gcp strange behavior
when I user distcp command as
hadoop distcp /a/b/c/d gs:/gcp-bucket/a/b/c/ , where d is a folder on HDFS containing subfolders.
If folder c is already there on gcp then it copies d ( and its ...
1
vote
1
answer
2k
views
hdfs distcp failing to copy from hdfs to s3
We have a snowball configured in our inhouse staging node with end-point http://10.91.16.213:8080. It all works properly, I can even list files in this snowball via s3 cli command
aws s3 ls my-bucket/...
1
vote
0
answers
848
views
Running distcp java job using hadoop yarn
I want to copy files present in hdfs to s3 bucket using java code. My java code implementation looks like this:
import org.apache.hadoop.tools.DistCp;
import org.apache.hadoop.tools.DistCpOptions;
...
0
votes
1
answer
552
views
DistCP Fail to get block MD5
There was one hidden file in the source cluster:
.part-1-1458.inprogress.xxxxxxxxx
Actually, this file was generated by Flink, and file size is 0.
When we use DistCp to copy the directory, we met an ...
-1
votes
1
answer
4k
views
Client cannot authenticate via: [TOKEN, KERBEROS)
From my spark application I am trying to distcp from hdfs to s3. My app does some processing on data and writes data to hdfs and that data I am trying to push to s3 via distcp. I am facing below error....
0
votes
0
answers
704
views
How to copy large number of smaller files from EMR (Hdfs) to S3 bucket?
I've a large csv file with below details:
total records: 20 million
total columns: 45
total file size: 8 GB
I am trying to process this csv file using Apache Spark (distributed computing engine) on ...
0
votes
2
answers
273
views
How to uncompress file while loading from HDFS to S3?
I have csv files in lzo format in HDFS I would like to load these files in to s3 and then to snowflake, as snowflake does not provides lzo compression for csv file format, I am required to convert it ...
0
votes
1
answer
2k
views
distcp local file to hadoop
I have 1 Gb file on local file system /tmp/dist_testfle
I can copy it it: hadoop fs -put file:///tmp/dist_testfile maprfs:///
But cannot distcp it. Command hadoop distcp file:///tmp/dist_testfile ...
0
votes
1
answer
778
views
Using distcp to copy to Azure ADLS Gen1 fails with 403
I am trying to copy to Azure Data Lake Storage (ADLS) Gen1, while authenticating using OAuth2.
I am getting the following error:
com.microsoft.azure.datalake.store.ADLException: Error getting ...
0
votes
0
answers
282
views
Not able to access s3 from ec2 instance using "hadoop fs -ls s3a://bucketname" command
I have installed Hadoop in psuedo-distributed mode on one ec2 instance. I have hdfs and yarn running.
I have assigned s3fullaccess role to my ec2 instance and able to access s3 using "aws s3 ls" ...
1
vote
1
answer
285
views
-Dmapred.job.name does not work with s3-dist-cp command
I'd like to copy some files from emr-hdfs to s3 bucket using s3-dist-cp, I've tried this cmd from "EMR Master Node":
s3-dist-cp -Dmapred.job.name=my_copy_job --src hdfs:///user/hadoop/abc s3://...
0
votes
1
answer
439
views
JSON aggregation using s3-dist-cp for Spark application consumption
My spark application running on AWS EMR loads data from JSON array stored in S3. The Dataframe created from it is then processed via Spark engine.
My source JSON data is in the form of multiple S3 ...
1
vote
1
answer
909
views
How to change hadoop distcp staging directory
when I ran command
hadoop distcp -update hdfs://path/to/a/file.txt hdfs://path/to/b/
I got an Java IOException:
java.io.IOException: Mkdirs failed to create /some/.staging/directory
However, I ...
1
vote
0
answers
384
views
distcp copy hive table transactional
I have a database in hive that I want to copy to a new database with crypto.
Tables are transactionals.
I used distcp to copy from the first db to the new encrypt one
hadoop distcp -skipcrccheck -...
1
vote
0
answers
701
views
"Requests specifying Server Side Encryption with AWS KMS managed keys require AWS Signature Version 4" when copying data from HDFS to S3
I am trying to use distcp to copy data from HDFS to S3, but I got an error:
Caused by: org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.S3ServiceException: S3 Error Message. -- ResponseCode: ...
1
vote
0
answers
914
views
Hadoop distcp to S3 performance is very slow
I am trying to copy the data from HDFS to Amazon S3 using hadoop distcp. the amount of data is 227GB and the job has been running for more than 12 hours.
Is there a hard limit of 3500 write requests ...
0
votes
1
answer
864
views
Using -update option in java distcp
My goal is to use the java distcp api in java.
With command line i am able to perform a distcp :
hadoop --config /path/to/cluster2/hadoop/conf distcp -skipcrccheck -update hdfs://clusterHA1/path/to/...
0
votes
3
answers
2k
views
Hadoop Distcp - small files issue while copying between different locations
I have tried to copy 400+ GB and one more distcp job with the size of data 35.6 GB, but both of them took nearly 2 -3 hours for the completion.
We do have enough resources in the cluster.
But when I ...
0
votes
0
answers
185
views
Hadoop Distcp: Input size is greater then output size
I am copying a folder from one path to another, basically creating a backup.
The source(input) folder size is 5 TB. I use the following distcp command to copy:
hadoop distcp -m 150 <...
1
vote
2
answers
795
views
Moving data from hive views to aws s3
Hi is there any ways we could move data from hive views to S3? For tables I am using distcp but since views doesnt have data residing in HDFS location I wasn't able to do distcp and I don't have ...
0
votes
1
answer
250
views
Transfer hive query result from one hadoop cluster to another hadoop cluster
I have two clusters A and B. Cluster A has 5 tables. Now I need to run a hive query on these 5 tables, result of the query should update the cluster B Table data(covers all the columns of result query)...
4
votes
1
answer
2k
views
Hadoop Distcp - increasing distcp.dynamic.max.chunks.tolerable config and tuning distcp
I am trying to move data between two hadoop clusters using distcp. There is a lot of data to move with a large number of small files. In order to make it faster, I tried using -strategy dynamic, which ...
1
vote
1
answer
114
views
Data Copy between ADLS Instances
Copying data between various instances of ADLS using DISTCP
Hi All
Hope you are doing well.
We have a use case around using ADLS as different tiers of the ingestion process, just required you ...