I have a file with one million lines. Each line has a field called transactionid
, which has repetitive values. What I need to do is to count them distinctly. No matter how many times a value is repeated, it should be counted only once.
3 Answers
OK, Assuming that your file is a text file, having the fields separated by comma separator ','. You would also know which field 'transactionid'
is in terms of its position. Assuming that your 'transactionid'
field is 7th field.
awk -F ',' '{print $7}' text_file | sort | uniq -c
This would count the distinct/unique occurrences in the 7th field and prints the result.
-
-
1@g10guang Becasue for
uniq
to eliminate records they need to be next to each other.– dszCommented Jan 13, 2020 at 3:43
Maybe not the sleekest method, but this should work:
awk '{print $1}' your_file | sort | uniq | wc -l
where $1
is the number corresponding to the field to be parsed.
-
that gives you the number of distinct values plus 1 (header). How can we avoid the counter to include header? Commented Mar 3, 2021 at 19:37
-
@neverMind Just tell
awk
to exclude the headers from printing:awk 'NR>1 {print $1}' | ...
– gentedCommented May 21, 2021 at 10:14
There is no need to sort the file .. (uniq
requires the file to be sorted)
This awk script assumes the field is the first whitespace delimiited field.
awk 'a[$1] == "" { a[$1]="X" } END { print length(a) }' file
-
1For a huge file (as in, getting close to the size of RAM), awk will consume a lot of memory. Most
sort
implementations are designed to cope well with huge files. Commented Jan 12, 2012 at 1:59
... No matter of how many times a value is repeated, it should be counted as 1. ...
cat <file_name> | awk -F"|" '{if ((substr($2,0,8)=='20120110')) print $28}' | sort -u | wc -l
the if clause was for another check of date as it seems obvious :)