Skip to main content
Filter by
Sorted by
Tagged with
0 votes
1 answer
56 views

UnsupportedFileSystemException after spring boot starter parent version upgrade to 3.2 or 3.3

We have Spring boot application service which call Hadoop distcp API. Application working fine with Java 17, Spring-boot-starter-parent 3.1.0 and Hadoop version 3.3.6 When we changed spring-boot-...
Girish's user avatar
  • 26
0 votes
0 answers
36 views

Redirecting distcp output to log4j from stdout

I am currently running distcp using org.apache.hadoop.util.ToolRunner.run and the output is being written to the console. DistCpOptions opts = getDistCpOptions(srcPath, dstPath); private ...
Dmitriy Vasilenko's user avatar
0 votes
0 answers
55 views

distcp throws java.io.IOException when copying files

I'm trying to copy 20TB of HDFS files from one cluster to another, using this single distcp command: hadoop distcp -Dmapreduce.map.memory.mb=16834 -Dmapreduce.reduce.memory.mb=32768 -Dyarn.app....
mohitblr's user avatar
0 votes
0 answers
33 views

How to copy data from an old Hive cluster to a new Hive cluster and keep all Hive metastore meta data?

I have two different Apache Hive clusters and I need to transfer data from the old to the new. What is the best way to accomplish this? The old cluster is using hive 1.2 with hadoop 2.7 and the new is ...
Quatervois's user avatar
0 votes
0 answers
69 views

s3-dist-cp groupby Regex Capture

I'm using EMR to combine hundreds of thousands of very small (1-5) row csv files. I want to concatenate them into around 100MB files so they are easier to work with. My EMR job uses command-runner.jar ...
MRMDP's user avatar
  • 11
0 votes
0 answers
288 views

Fastest way to copy large data from HDFS location to GCP bucket using command

I have a 5TB of data which need to transfer to GCP bucket using some command. I tried using hadoop discp -m num -strategy dynamic source_path destination_path. It's still getting executed since long. ...
Chetan Mane's user avatar
0 votes
1 answer
181 views

GCS Connector on EMR failing with java.lang.ClassNotFoundException

I have created an emr cluster with the instructions on how to create a connection from gcs provided here and keep running the hadoop distcp command. It keeps failing with the following error: 2023-07-...
Abhinav Rai's user avatar
0 votes
0 answers
267 views

How to specify a filter to exclude hdfs file of a partition when calling distcp -update?

Hadoop distcp -update is used with a filter. I want to exclude the hdfs file of the partition dt=20230621. What should I do? The command I am using now is $ hadoop distcp -update -append -filters &...
Hang Mao's user avatar
0 votes
0 answers
503 views

requested yarn user xxx not found when running distcp from hadoop to GCS

when i try and run the distcp command hadoop distcp /user/a.txt gs://user/a.txt i get the message ERROR tools.DistCp: Exception encountered main : run as user is xxx main : requested yarn user is ...
Liem Nguyen's user avatar
1 vote
0 answers
419 views

Hadoop distcp does not skip CRC checks

I have an issue with skipping CRC checks between source and target paths running distcp. I copy and decrypt files on demand and their checksum is different, that is expected. My command looks like ...
Nazar Barabash's user avatar
0 votes
0 answers
227 views

Hadoop distcp to S3

I am using Hadoop distcp command to move data from hdfs to s3. Recently after hadoop cdh to cdp upgrade I am facing a difference in -update option. Previously-update will move files with same file ...
Rajkumar's user avatar
  • 189
0 votes
1 answer
129 views

distcp one table to another table with different name

Note: table is name of hive table, which is HDFS directory. I have two servers, C1 and C2. C1 has a table item.name with sequential format. C2 has a table item.name with orc format, which has same ...
natarajan k's user avatar
0 votes
2 answers
890 views

distcp - copy data from cloudera hdfs to cloud storage

I am trying to replicate data between hdfs and my gcp cloud storage. This is not one time data copy. After first copy, I want copy only new files, updates files. and if files are deleted on on-prem it ...
Gaurang Shah's user avatar
  • 12.8k
0 votes
1 answer
358 views

Hadoop distcp: what ports are used?

If I want to use distCp on an on-prem hadoop cluster, so it can 'push' data to external cloud storage, what firewall considerations must be made in order to leverage this tool? What ports does the ...
007chungking's user avatar
0 votes
1 answer
364 views

DistCP - Even simple copies result in CRC Exceptions

I'm running into an issue using distcp to copy files - every copy fails with an IO Exception (Checksum mismatch), even if performing a simple copy within the cluster (i.e. hadoop distcp -pbugctrx /foo/...
Adam Luchjenbroers's user avatar
0 votes
0 answers
458 views

How should I move files/folders from one Hadoop cluster to another and delete the source contents after moving?

Using hadoop's distcp command I am able to move the files across clusters but my requirement is after moving it should delete the contents from the source. hadoop distcp -update -delete -strategy ...
Sachin's user avatar
  • 1
0 votes
1 answer
809 views

One single distcp command to upload several files to s3 (NO DIRECTORY)

I am currently working with the s3a adapter of Hadoop/HDFS to allow me to upload a number of files from a Hive database to a particular s3 bucket. I'm getting nervous because I can't find anything ...
tprebenda's user avatar
  • 504
0 votes
1 answer
708 views

How can I get distcp failed files and replay the task?

I have distcp a file between two hdfs cluster with same version,when I execute failed ,I want to find the failed mapreduce task and related file path,then replay.
Allen Wod's user avatar
0 votes
2 answers
191 views

How to grab all hive files after a certain date for s3 upload (python)

I'm writing a program for a daily upload to s3 of all our hive tables from a particular db. This database contains records from many years ago, however, and is way too large for a full copy/distcp. I ...
tprebenda's user avatar
  • 504
0 votes
1 answer
66 views

HDFS: yarn export kills when some datanodes run out of space but there are still nodes with plenty of space

We have 2 Hadoop clusters. We want to export an Hbase snapshot from one cluster to the other. The target cluster is composed of 3 datanodes of 128TB and 5 datanodes of 28TB each. Everything goes ...
gesala's user avatar
  • 1
6 votes
0 answers
300 views

issue when copying hive partitioned table from one cluster to another

I run a distcp command to copy the hdfs location of a table to another cluster. The copy is scheduled to run every 8 hours. I run the 'msck repair table' command but not always after the copy. I have ...
RRy's user avatar
  • 453
2 votes
0 answers
281 views

Error in accessing google cloud storage bucket via hadoop fs -ls that runs on Cloudera Hadoop CDH 6.3.3 integrated with Kerberos/SSL/LDAP cluster

I am getting the below error while accessing a Google Cloud Storage bucket for the first time via Cloudera CDH 6.3.3 Hadoop Cluster. I am running the command on the edge node where Google Cloud SDK is ...
bobby's user avatar
  • 21
0 votes
1 answer
493 views

Is it possible to write directly to final file with distcp?

I'm trying to upload file to s3a using distcp. distcp writes to temporary file first and than renames it to proper filename. But user does not allow for update/delete. So I have file with proper size, ...
psmith's user avatar
  • 1,813
1 vote
0 answers
198 views

Unable to copy HDFS data to S3 bucket

I have an issue related to a similar question asked before. I'm unable to copy data from HDFS to an S3 bucket in IBM Cloud. I use command: hadoop distcp hdfs://namenode:9000/user/root/data/ s3a://hdfs-...
dddroog's user avatar
  • 55
0 votes
0 answers
648 views

Distcp in Kerberized Cloudera Cluster throws java.io.IOException: Can't get Master Kerberos principal for use as renewer

I am performing distcp operation to copy the files from one directory to another in same cluster. It is a Kerberized Cloudera Hadoop Cluster. Command I ran: hadoop distcp -overwrite hdfs://nameservice/...
Vijay's user avatar
  • 123
-1 votes
2 answers
2k views

Copying data from one s3 bucket to another s3 bucket of different account in fast manner, just using access_id, secret_access_key cred of both

I have access_key, access_id for both of the aws bucket belong to a different account. I have to copy data from one location to another, is there a way to do it faster. I have tried map-reduced-based ...
lifeisshubh's user avatar
0 votes
1 answer
236 views

Does node with mapr client need to have an access to the files I want to copy with distcp?

Situation: Node 0: mapr client installed, not a part of a cluster, no external resources mounted Node 1 to 10 : mapr cluster nodes with mapr NodeManager installed. Every node has mounted external ...
psmith's user avatar
  • 1,813
2 votes
1 answer
589 views

Hadoop distcp copy from on prem to gcp strange behavior

when I user distcp command as hadoop distcp /a/b/c/d gs:/gcp-bucket/a/b/c/ , where d is a folder on HDFS containing subfolders. If folder c is already there on gcp then it copies d ( and its ...
Vicky's user avatar
  • 1,318
1 vote
1 answer
2k views

hdfs distcp failing to copy from hdfs to s3

We have a snowball configured in our inhouse staging node with end-point http://10.91.16.213:8080. It all works properly, I can even list files in this snowball via s3 cli command aws s3 ls my-bucket/...
Anum Sheraz's user avatar
  • 2,605
1 vote
0 answers
848 views

Running distcp java job using hadoop yarn

I want to copy files present in hdfs to s3 bucket using java code. My java code implementation looks like this: import org.apache.hadoop.tools.DistCp; import org.apache.hadoop.tools.DistCpOptions; ...
Divya's user avatar
  • 31
0 votes
1 answer
552 views

DistCP Fail to get block MD5

There was one hidden file in the source cluster: .part-1-1458.inprogress.xxxxxxxxx Actually, this file was generated by Flink, and file size is 0. When we use DistCp to copy the directory, we met an ...
Brooklyn Knightley's user avatar
-1 votes
1 answer
4k views

Client cannot authenticate via: [TOKEN, KERBEROS)

From my spark application I am trying to distcp from hdfs to s3. My app does some processing on data and writes data to hdfs and that data I am trying to push to s3 via distcp. I am facing below error....
Mukesh Kumar's user avatar
0 votes
0 answers
704 views

How to copy large number of smaller files from EMR (Hdfs) to S3 bucket?

I've a large csv file with below details: total records: 20 million total columns: 45 total file size: 8 GB I am trying to process this csv file using Apache Spark (distributed computing engine) on ...
TheCodeCache's user avatar
0 votes
2 answers
273 views

How to uncompress file while loading from HDFS to S3?

I have csv files in lzo format in HDFS I would like to load these files in to s3 and then to snowflake, as snowflake does not provides lzo compression for csv file format, I am required to convert it ...
Vishrant's user avatar
  • 16.6k
0 votes
1 answer
2k views

distcp local file to hadoop

I have 1 Gb file on local file system /tmp/dist_testfle I can copy it it: hadoop fs -put file:///tmp/dist_testfile maprfs:/// But cannot distcp it. Command hadoop distcp file:///tmp/dist_testfile ...
Takito Isumoro's user avatar
0 votes
1 answer
778 views

Using distcp to copy to Azure ADLS Gen1 fails with 403

I am trying to copy to Azure Data Lake Storage (ADLS) Gen1, while authenticating using OAuth2. I am getting the following error: com.microsoft.azure.datalake.store.ADLException: Error getting ...
Hagai Attias's user avatar
0 votes
0 answers
282 views

Not able to access s3 from ec2 instance using "hadoop fs -ls s3a://bucketname" command

I have installed Hadoop in psuedo-distributed mode on one ec2 instance. I have hdfs and yarn running. I have assigned s3fullaccess role to my ec2 instance and able to access s3 using "aws s3 ls" ...
Ram Prakash's user avatar
1 vote
1 answer
285 views

-Dmapred.job.name does not work with s3-dist-cp command

I'd like to copy some files from emr-hdfs to s3 bucket using s3-dist-cp, I've tried this cmd from "EMR Master Node": s3-dist-cp -Dmapred.job.name=my_copy_job --src hdfs:///user/hadoop/abc s3://...
TheCodeCache's user avatar
0 votes
1 answer
439 views

JSON aggregation using s3-dist-cp for Spark application consumption

My spark application running on AWS EMR loads data from JSON array stored in S3. The Dataframe created from it is then processed via Spark engine. My source JSON data is in the form of multiple S3 ...
user1988501's user avatar
1 vote
1 answer
909 views

How to change hadoop distcp staging directory

when I ran command hadoop distcp -update hdfs://path/to/a/file.txt hdfs://path/to/b/ I got an Java IOException: java.io.IOException: Mkdirs failed to create /some/.staging/directory However, I ...
konchy's user avatar
  • 849
1 vote
0 answers
384 views

distcp copy hive table transactional

I have a database in hive that I want to copy to a new database with crypto. Tables are transactionals. I used distcp to copy from the first db to the new encrypt one hadoop distcp -skipcrccheck -...
Jeannot77680's user avatar
1 vote
0 answers
701 views

"Requests specifying Server Side Encryption with AWS KMS managed keys require AWS Signature Version 4" when copying data from HDFS to S3

I am trying to use distcp to copy data from HDFS to S3, but I got an error: Caused by: org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.S3ServiceException: S3 Error Message. -- ResponseCode: ...
michelle's user avatar
  • 197
1 vote
0 answers
914 views

Hadoop distcp to S3 performance is very slow

I am trying to copy the data from HDFS to Amazon S3 using hadoop distcp. the amount of data is 227GB and the job has been running for more than 12 hours. Is there a hard limit of 3500 write requests ...
Hemanth's user avatar
  • 735
0 votes
1 answer
864 views

Using -update option in java distcp

My goal is to use the java distcp api in java. With command line i am able to perform a distcp : hadoop --config /path/to/cluster2/hadoop/conf distcp -skipcrccheck -update hdfs://clusterHA1/path/to/...
maxime G's user avatar
  • 1,772
0 votes
3 answers
2k views

Hadoop Distcp - small files issue while copying between different locations

I have tried to copy 400+ GB and one more distcp job with the size of data 35.6 GB, but both of them took nearly 2 -3 hours for the completion. We do have enough resources in the cluster. But when I ...
Vijay's user avatar
  • 111
0 votes
0 answers
185 views

Hadoop Distcp: Input size is greater then output size

I am copying a folder from one path to another, basically creating a backup. The source(input) folder size is 5 TB. I use the following distcp command to copy: hadoop distcp -m 150 <...
subzero740's user avatar
1 vote
2 answers
795 views

Moving data from hive views to aws s3

Hi is there any ways we could move data from hive views to S3? For tables I am using distcp but since views doesnt have data residing in HDFS location I wasn't able to do distcp and I don't have ...
Rajkumar's user avatar
  • 189
0 votes
1 answer
250 views

Transfer hive query result from one hadoop cluster to another hadoop cluster

I have two clusters A and B. Cluster A has 5 tables. Now I need to run a hive query on these 5 tables, result of the query should update the cluster B Table data(covers all the columns of result query)...
Venkatesh N's user avatar
4 votes
1 answer
2k views

Hadoop Distcp - increasing distcp.dynamic.max.chunks.tolerable config and tuning distcp

I am trying to move data between two hadoop clusters using distcp. There is a lot of data to move with a large number of small files. In order to make it faster, I tried using -strategy dynamic, which ...
Hemanth's user avatar
  • 735
1 vote
1 answer
114 views

Data Copy between ADLS Instances

Copying data between various instances of ADLS using DISTCP Hi All Hope you are doing well. We have a use case around using ADLS as different tiers of the ingestion process, just required you ...
Avinash's user avatar
  • 11