10

I've got a directory tree created by rsnapshot, which contains multiple snapshots of the same directory structure with all identical files replaced by hardlinks.

I would like to delete all those hardlink duplicates and keep only a single copy of every file (so I can later move all files into a sorted archive without having to touch identical files twice).

Is there a tool that does that?
So far I've only found tools that find duplicates and create hardlinks to replace them…
I guess I could list all files and their inode numbers and implement the deduplicating and deleting myself, but I don't want to reinvent the wheel here.

1
  • 2
    find . ! -type d -links +1 finds files that are linked to more than one directory. Commented May 31, 2016 at 14:51

6 Answers 6

7

In the end it wasn't too hard to do this manually, based on Stéphane's and xenoid's hints and some prior experience with find.
I had to adapt a few commands to work with FreeBSD's non-GNU tools — GNU find has the -printf option that could have replaced the -exec stat, but FreeBSD's find doesn't have that.

# create a list of "<inode number> <tab> <full file path>"
find rsnapshots -type f -links +1 -exec stat -f '%i%t%R' {} + > inodes.txt

# sort the list by inode number (to have consecutive blocks of duplicate files)
sort -n inodes.txt > inodes.sorted.txt

# remove the first file from each block (we want to keep one link per inode)
awk -F'\t' 'BEGIN {lastinode = 0} {inode = 0+$1; if (inode == lastinode) {print $2}; lastinode = inode}' inodes.sorted.txt > inodes.to-delete.txt

# delete duplicates (watch out for special characters in the filename, and possibly adjust the read command and double quotes accordingly)
cat inodes.to-delete.txt | while read line; do rm -f "$line"; done
2
  • 1
    find rsnapshots -type f -exec stat -f '%i%t%R' {} + | sort -k1,1 -u | cut -f2- will give you a sorted list of ALL filenames under the rsnapshots directory, with duplicate inodes removed. You can feed that in to archiving programs (e.g. tar). BTW, many archiving or backup programs (like tar, or rsync with the -H option) already know how to handle hardlinks (i.e. storing only the hardlink, rather than another copy of the file), so this isn't even necessary for them.
    – cas
    Commented Jun 1, 2016 at 2:07
  • NOTE: the FreeBSD version of cut doesn't (yet?) suppport NUL-separated input, so the find pipeline above is only safe for filenames that don't contain newlines in them.
    – cas
    Commented Jun 1, 2016 at 2:08
4

To find the inodes that have more that one link:

 find . -exec stat -c '%i' {} \; | sort -n | uniq -d

Then you can iterate that list with

 find -inum {inode_number}

to list the files that share the inode. Which to remove is up to you.

4

If you know that all the files have hardlinks within a single directory hierarchy only, you can simply do

find inputdir -type f -links +1 -exec rm {} \;

The reason this works is that rm {} \; removes exactly one file immediately the stat() returns count more than 1. As a result, the hard link count of the other parts of the same inode will be decreased by 1 and if the file is then the only copy, the rm will not be run for that last file by the time find runs stat() against that file.

Note that if any file has hardlinked copies outside the inputdir this command will remove all copies within inputdir hierarchy!

1
  • 2
    That caveat sounds like exactly what I want. Just point it at one of two slightly different backup dirs I've deduplicated by producing hardlinks and I'm left with a complete up-to-date backup and a folder containing only files that differ from it.
    – ssokolow
    Commented Aug 12, 2021 at 23:43
1

I think you are mistaken, since you think deleting all the "other" links to a file will save space. The only space you will save is a directory entry, and even that is questionable.

All hard links to a file are equal. There are no "duplicates". Files on Linux are really identified by which filesystem they are on, and what Inode number they are on that filesystem.

So when you create a file, you create an Inode, where the blocks actually live, and you create a link to that file in some directory. That link just points at that inode. If you do a hard link from that directory entry to another place, you just create a second directory entry someplace pointing to that same file.

If you run ls -i on a file, you will see it's Inode number. If you want to find other hard links to that same inode, simply run:

find /TOP-OF-FILESYSTEM -type f -inum INODE-NUMBER

Where TOP-OF-FILESYSTEM is the mount point for that filesystem, INODE-NUMBER is the inode number for the file in question. Note that "-type f" is not mandatory but just speeds up the search since you only will look for files.

Note that running ls -il on a file also (by default) it's inode number.

You can test all of this by going to a scratch directory and creating a file, then creating another link to it:

cd ~/tmp
date > temp1
ln tmep1 temp2
ls -l temp*
3
  • 3
    This is all true but it doesn't answer the question. I think you missed the point: this isn't about disk space, it's about processing the files without processing the same file twice under different names. Commented May 31, 2016 at 21:54
  • 1
    Thanks for the explanation, but I'm already familiar with how hardlinks work and that they don't consume any noteworthy amount of disk space — like I said I just wanted to prune my old rsnapshot directory to only keep a single link to each inode, so I won't end up looking at the same file twice when I go through the data and sort it into my new archive.
    – n.st
    Commented May 31, 2016 at 22:04
  • 1
    Okay, that was not very clear. :) But most backup programs (even tar) understand hard links and don't waste space. But I'm glad you solved your problem.
    – Lee-Man
    Commented Jun 1, 2016 at 2:04
1

rmlint will find and remove duplicates, including hardlinks. At the moment it has no options to remove only hardlinks. Removal is done via an autogenerated shell script, so you can review that script prior to deletion.

In general be careful when using duplicate file detectors in hardlink mode (eg fdupes -H) since they can sometimes mistakenly identify a file as its own duplicate (see "path doubles" discussion here).

0

To make a copy which has no hard links you could copy them to a different file system and then back again or just leave them on the other file system.

To the best of my knowledge hard links do not span file systems. I am happy to be told I am wrong on that point as it will fill a hole in my understanding but I am quite confident I am correct (generally) on this point.

2
  • Either I don’t understand your answer, or you don’t understand the question. The OP wants to “keep only a single copy of every file” — with your answer, they would end up with N copies of every file that was previously linked. Commented May 21, 2022 at 19:42
  • Sorry I was under the impression coping just one snapshot folder would solve the issue. I guessed they wanted a snapshot that was unlinked.
    – dij8al
    Commented May 23, 2022 at 4:00

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .