Questions tagged [deduplication]
For questions where a task is to be applied to only one instance of multiple copies of data (files or blocks of data on a filesystem, or strings in a text), or where duplicates of the first such instance are to be ignored for space/time saving purposes.
80 questions
0
votes
1
answer
32
views
Deduplication tool which is able to only compare two directories against each other?
I tried rdfind and jdupes, and if I specify two directories for them, they both match not only files in one directory against other directory, but also files inside of one of the given directories ...
0
votes
1
answer
38
views
Pop OS with a deduplication filesystem
I'm moving a friend development machine to Linux (PopOS), permanently. Don't worry guys, he dualbooted and he's ready for the tux.
The problem is his drive. It's a 256GB SSD, and he is moving from a ...
-1
votes
2
answers
124
views
Batch rename of files that share the same prefix
I have a list of files on my server with a prefix that I want to de-dupe. These are completely different generated files.
It seems to be generated files with
{Title} - {yyyy-MM-dd}_{random} - {...
0
votes
0
answers
48
views
20+ backup directories, I'd like to dedupe all files to 1 "master directory"
As the title suggests, I have inherited a file structure where there are about 30 "complete or partial backups" of a fileserver full of text files. This obviously makes no sense, and I'd ...
5
votes
6
answers
890
views
Keep unique values (comma separated) from each column
I have a .tsv (tab-separated columns) file on a Linux system with the following columns that contain different types of values (strings, numbers) separated by a comma:
col1 col2
. NS,NS,...
0
votes
2
answers
498
views
Standalone Fileserver with deduplication wanted
Situation:
I want to reinstall a Homelab Server (Windows OS) as a linux-based Server
Server | Purpose: Backup System (mostly offline)
I currently have an HP Proliant Microserver N54
Turion II Neo N54l ...
1
vote
1
answer
59
views
See if any of a number of zip files contains any of the original files in a directory structure
I have a pretty hard problem here.
I have a photo library with a lot of photos in it in various folders.
I then started using Google Photos for my photos, I put those originals into Google Photos, and ...
-1
votes
2
answers
165
views
Join files together without using space in filesystem [duplicate]
I want to join (concatenate) two files in Linux without using space in the filesystem. Can I do this?
A + B = AB
The file AB use sectors or fragments of A and B from the filesystem. Is it possible to ...
0
votes
2
answers
78
views
I want list only list the text using grep or any other option using shell script
I have folder called rules/resources, in that I have the sub folders lets say A, B and C. Each sub-folder contains constraint.yaml.
Now I want to grep the constraint.yaml files which contain the ...
0
votes
1
answer
253
views
Removing duplicates from multiple directories(more than 2 paths) using rmlint or another tool
I am trying to remove duplicated files and folders from several directories, and I was wondering if rmlint supports inputting multiple directories (I know you can use two directories if one of them is ...
0
votes
2
answers
945
views
Linux command for moving, merging and renaming duplicates
I am trying to move directories (with sub-directories and files) to another directory. With mv some folders are not merging because the same directory exists with files. This is no good because even ...
2
votes
2
answers
3k
views
How to get deduplicatuion for Ext4 partition used by Debian, Ubuntu and Linux Mint?
Ext4 don't support de duplication, against p.e. btrfs, bcachefs and ZFS, reduplication by standard.
How to get support of reduplication for Ext4 ?
3
votes
1
answer
1k
views
Choosing the right block size for duperemove
I am trying to deduplicate a BTRFS filesystem with multiple subvolumes. Altogether, it holds around 3.5 TB of data, which I expect to be slightly more than half that size after deduping. I am mostly ...
4
votes
1
answer
3k
views
Is there a way to consolidate (deduplicate) btrfs?
I have a btrfs volume, which I create regular snapshots of. The snapshots are rotated, the oldest being one year old. As a consequence, deleting large files may not actually free up the space for a ...
0
votes
4
answers
158
views
Filtering duplicates with AWK differing by timestamp
Given the list of files ordered by timestamp as shown below. I am seeking to retrieve the last occurrence of each file (the one at the bottom of each)
For example:
archive-daily/document-sell-report-...
-2
votes
1
answer
575
views
low-memory server with ZFS deduplication: can zram help?
Is it a good idea to use compressed memory if people want to save on expensive RAM when using ZFS with deduplication?
-1
votes
1
answer
390
views
can swap help a low-memory ZFS server?
If we cannot buy many RAM, can we replace the "missing" RAM with higher amount of SWAP? Ex.: using a dedicated 512 GB SSD for SWAP instead of 512 GB RAM for ZFS with deduplication on an ...
0
votes
2
answers
52
views
Why do these two commands to write text-processing results back to the input file behave so differently?
I have a file authorized_keys and want to deduplicate the content, i.e. remove duplicate entries.
I found two possible solutions to achieve this:
Use cat and uniq, then redirect the output to the ...
0
votes
1
answer
77
views
Why did sort -u removed duplicates only in pipe?
In this command, sort -u removed duplicates.
curl https://en.wikipedia.org/wiki/Help:Special_page -s | grep -oP 'Special:\K[a-zA-Z0-9]*' | sort -u > special_page_names
In this command, it didn't.
...
1
vote
1
answer
925
views
How to let already copied files share fragments (reflink)?
I copy a file to a different XFS volume on daily basis as follows:
# on monday
cp --sparse=always /mnt/disk1/huge.file /mnt/disk2/monday/huge.file
# on tuesday
cp --sparse=always /mnt/disk1/huge.file /...
0
votes
1
answer
1k
views
lvmvdo doesn't deduplicate my data
I am installing lvmvdo con Debian 11.2 to store Proxmox vm disk, doing the following :
1) apt install -y build-essential libdevmapper-dev libz-dev uuid-dev git sudo libblkid-dev man vim dwarves dkms ...
1
vote
1
answer
273
views
Meaning of Deduplicated during Borg Create's Realtime Output
When the borg create command is used with the --progress argument, it output like this:
5.50 GB O 5.10 GB C 23.95 kB D 15600 N /path/to/current/file/being/processed
I was able to locate what the ...
1
vote
2
answers
3k
views
How to install vdo/kvdo on Ubuntu 20.04?
I would like to know if there is a way to install Red Hat's vdo in Ubuntu 20.04.
So far, I have tried to download the source and compile it, but I get the following error:
cc -fPIC -fpic -D_GNU_SOURCE ...
6
votes
3
answers
4k
views
Finding duplicate files with same filename AND exact same size
I have a huge songs folder with a messy structure and files duplicated in multiple folders.
I need a recommendation for a tool or a script that can find and remove duplicates with simple two matches:
...
4
votes
1
answer
2k
views
Is there a fuzzy duplicate finder for videos, that does not require a GUI?
I am currently trying to eliminate duplicated videos with minimal changes. Those might be a slightly different encoding, a lower resolution or just changed meta data. These videos are in a complex ...
0
votes
6
answers
556
views
Remove adjacent duplicated words from string
I have a string like this string:
one one tow tow three three tow one three
How can i remove duplicated words to make it like this:
one tow three tow one three
The point is that I want to write a ...
0
votes
4
answers
1k
views
How to remove duplicate values on the same row using awk?
I want to remove duplicated columns/fields on the same row only. I tried, but I ended up with a long code with nest loops, conditions and arrays that doesn't work correctly.
input data:
1 2 3 4
1 2 3 ...
0
votes
2
answers
353
views
Script to find duplicate files by extension and delete them
Recently my NAS was ransomware attacked and all my files were 7zipped. I managed to get the password and extract them and at the same time I renamed the 7zipped file to 7z.bad (so that it's easier ...
3
votes
1
answer
4k
views
Is there any advantage to using a hard-link on ZFS instead of relying upon deduplication when considering only disk space allocation?
If I want to create multiple instances of a file on a ZFS file system, is there any advantage to using a hard-link instead of relying upon deduplication as a method of preserving disk space?
This ...
0
votes
0
answers
113
views
I want to write a shell script to delete duplicates in a directory
I do
find ./ -type f -exec md5 '{}' \; > tempfile
tempfile has lines like this:
MD5 (.//Photos-10/IMG_20200901_183050612.jpg) = 2e1f245d195b8d2c3z926dbe0410f7b5
I want to check for repeated ...
2
votes
2
answers
2k
views
Deduplicating Files while moving them to XFS
I've got a folder on a non reflink-capable file system (ext4) which I know contains many files with identical blocks in them.
I'd like to move/copy that directory to an XFS file system whilst ...
1
vote
2
answers
3k
views
How to use `rmlint` to remove duplicates only from one location and leave all else untouched?
I have two locations /path/to/a and /path/to/b. I need to find duplicate files in both paths and remove only items in /path/to/b items. rmlint generates quite a large removal script, but it contains ...
2
votes
1
answer
341
views
FIDEDUPERANGE ioctl doesn't behave as expected on btrfs
According to ioctl_fideduperange,
The maximum size of src_length is filesystem dependent
and is typically 16 MiB.
However, I've been able use src_length of > 1 Gib successfully with a single call ...
1
vote
1
answer
712
views
How to use rmlint to merge two large folders?
In exploring options to merge two folders, I've come across a very powerful tool known as rmlint. It has some useful documentation (and Gentle Guide).
I have a scenario that I previously mentioned and ...
1
vote
2
answers
950
views
Finding duplicate files using bash script
How do you write a bash one-liner that will find binary files with identical contents, permissions, and owner on the same ext4 file-system, from the current working directory recursively, and replace ...
0
votes
1
answer
313
views
Find duplicate paragraphs in two files and delete one
I have two bib files, some of the entries are duplicates, the duplicates entries are in paragraphs, or could be identified with same pattern, e.g.
a.bib looks like
@InProceedings{Arranged,
author = {...
1
vote
1
answer
31
views
Separately storing parts of text files and their reconstruction: symlinks with multiple targets?
I have two text files whose headers are different, while their contents is the same.
$ cat original_file_v1
header 1 beginning
header 1 contents
header 1 end
common contents line 1
common contents ...
1
vote
1
answer
171
views
Hard link duplicate files based on just size
I'm currently running rdfind on a directory containing more than 4TB of files. Since the checksum part takes an inordinate amount of time I'm looking for alternatives. I know fairly certain that there ...
2
votes
0
answers
127
views
dedup seems to be working, but free space says otherwise
I'm testing out dedup with zol on Debian.
I've set dedup=on on the zvol before copying any data, then copied data to the mounted volume with no problems.
Then I've run cp -r folder1/* folder2/ on test ...
2
votes
0
answers
547
views
Can I share a flatpak directory across hosts?
The context is more or less as follows:
I run Distro Foo as my main driver. Inside that I have also set up a few chroots or schroots with Distro Bar, Distro Baz, and an older (or newer) version of ...
0
votes
2
answers
125
views
Remove duplicated first fields of a CSV file
I try to remove repetitions of the same value of the first column from a CSV file without changing the other cell contents and alignment (in other columns).
My txt:
ACCIDENT EP 4 STEM PERCUS,, ...
3
votes
2
answers
470
views
Remove duplicate lines from files recursively in place but leave one - make lines unique across files
I have many folders and folders contain files. The same line might appear multiple times in single file and/or in multiple files. Files are not sorted. So there are some lines duplicated across ...
1
vote
1
answer
1k
views
Append text file with command output, but replace the words that already exist (don't add the same text twice)
I am appending a command output to a text file. But if I do it again it will have the same text twice in the text file. Is there a way with for example sed that if a word already exist don't add a new ...
0
votes
1
answer
940
views
How to run Shredder Duplicate Finder (rmlint --gui) on Debian? ("Failed to load shredder: No module named 'shredder'")
I'd like to run the rmlint GUI (Shredder) on Debian10 but I get this error:
Failed to load shredder: No module named 'shredder'
This might be due to a corrupted install; try reinstalling.
2
votes
0
answers
274
views
Compare disks for missing files with differing directory structure
Apologies for the length of this post, I've tried to keep it short as possible.
I'm looking for a tool / method which, given two paths, would show which files are not present in one of the paths. The ...
5
votes
0
answers
751
views
Backing up a deduplicated BTRFS filesystem
I have some long-term data in a BTRFS volume. I've been using btrfs-dedupe to deduplicate my data and I'm able to save a tremendous amount of disk space between filesystem compression and ...
7
votes
2
answers
3k
views
Is there a block-level storage file system?
I'm looking for a file system that stores files by block content, therefore similar files would only take one block. This is for backup purposes. It is, similar to what block-level backup storage ...
10
votes
1
answer
7k
views
Is there a way to enable reflink on an existing XFS filesystem?
I currently have a 4TB RAID 1 setup on a small, personal Linux server, which is formatted as XFS in LVM. I am interested in enabling the reflink feature of XFS, but I did not do so when I first ...
4
votes
1
answer
1k
views
How to copy multiple snapshots at once without duplicating data?
I have a live btrfs filesystem of 3.7TiB that's >90% full including old snapshots and a fresh 4TB backup harddisk. How to copy all existing snapshots to the backup harddisk?
I tried
# btrfs send ...
2
votes
1
answer
68
views
Checking identical files in Linux and deleting according to location
I use fdupes to find and delete identical files.
But I want to be able to say something like this ...
find all the files that are duplicate in directory A or its subdirectories
if there's a ...