Finding duplicate files using bash script

Question

How do you write a bash one-liner that will find binary files with identical contents, permissions, and owner on the same ext4 file-system, from the current working directory recursively, and replace all files with older access times with hard links to the latest accessed file and report saved disk space in kibibytes?

What I achieved until now is not fully sufficient for the requirements of the objective.

#! /bin/sh
fdupes -r -p -o 'time' . | xargs file -i | grep binary | awk '{print $1}' | awk '{print substr($0,3)}' | sed 's/.\{1\}$//' | xargs rdfind -makehardlinks true

What is missing from your current solution until all your requirements are completed? — thanasisp, Commented Oct 12, 2020 at 18:41
hard links of files are not created based on the owner in rdfind. — Himadri Ganguly, Commented Oct 12, 2020 at 18:58
There is the hardlink command for this purpose. Read man hardlink. — waltinator, Commented Oct 13, 2020 at 1:11
@waltinator But hardlink, don't have the features that this problem requires. — Himadri Ganguly, Commented Oct 13, 2020 at 4:24
While fdupes is returning groups of files (in paragraph mode or using -1 in a line), you lose this grouping after the first following command, and you need it for any processing later, where will you point a hard link? To the first or last, according to your time order, of that files group. Also filenames should be preserved. — thanasisp, Commented Oct 13, 2020 at 4:27

thanasisp · Accepted Answer · 2020-10-14 08:34:49Z

0

hardlink may not satisfy all requirements for this, but it can be used for what it is, to make the hardlinks. It can accept file arguments, not only directories, and it seems it is always linking a group of identical files to the first in order. Also it will ignore zero size files.

fdupes selects exactly what needed, but does not output real file arguments but a paragraph-mode output, with groups of identical files, every group is ended with an empty line.

So in order to be sure that the exact selections of fdupes will be hardlinked, we have to call hardlink separately once per paragraph. To avoid the case where two pairs of the same identicals exists for different owners or with different permissions. And of course files have to be filtered for binaries.

#!/bin/bash
unset arr i
while IFS= read -r f; do

    # move file to array if binary
    if file -i "$f" | grep -q "charset=binary"; then
        arr[++i]="$f"
    fi
    
    # if end of paragraph and array has files, hardlink and unset array
    if [[ "$f" == "" && "${arr[@]}" ]]; then
        printf "\n => Hardlink for %d files:\n" "$i"
        hardlink -n -c -vv "${arr[@]}"
        unset arr i
    fi

done < <(fdupes -rpio time .)

hardlink with -n parameter simulates and does not write anything, so test the above as is and remove -n later.

Also filenames with newlines are not handled, testing with whitespaces seems ok.

edited Oct 14, 2020 at 8:34

answered Oct 13, 2020 at 8:59

thanasisp

8,3922 gold badges28 silver badges40 bronze badges

But when you feed the files to hardlink you cannot differentiate files from different users.
– Himadri Ganguly
Commented Oct 13, 2020 at 10:53
The use case is fileA and fileB are identical in regards to content, permission and owner is USER1 with fileA's accessed time is the latest. Another fileC and fileD are identical in regards to content, permission, and owner is USER2 with fileD's accessed time is the latest. Then fileB will be replaced with fileA's hardlink and fileC with be replaced by fileD's hardlink.
– Himadri Ganguly
Commented Oct 13, 2020 at 17:58
Thank you a lot the script above does most of the job. But will be considered a one-liner bash script and how we can display the size saved in kibibytes?
– Himadri Ganguly
Commented Oct 14, 2020 at 13:07
If this answer helps you, you may also check this. The output also has a very good -vv reporting that can be parsed, stored, modified etc.
– thanasisp
Commented Oct 14, 2020 at 16:36

Add a comment |

Himadri Ganguly · Accepted Answer · 2020-10-15 06:44:52Z

Finally got the desired result. Thanks to @thanasisp. For this, you need two programs fdupes and rdfind.

#!/bin/bash
unset arr i; while IFS= read -r f; do if file -i "$f" | grep -q "charset=binary"; then arr[++i]="$f"; fi; if [[ "$f" == "" && "${arr[@]}" ]]; then printf "\n => Hardlink for %d files:\n" "$i";rdfind -makehardlinks true "${arr[@]}" | grep "Total size is" | grep -P "[0-9]+" -o  | head -1 | awk -v count="$i" '{print $1/count;}' | awk '{printf("%s kibibytes saved.\n",$1/1024)}'; unset arr i; fi; done < <(fdupes -rpio time .)

Stack Exchange Network

Finding duplicate files using bash script

2 Answers 2

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
bash
hard-link
deduplication
.

Hot Network Questions

Finding duplicate files using bash script

2 Answers 2

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged bashhard-linkdeduplication.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
bash
hard-link
deduplication
.