30

I have a file with one million lines. Each line has a field called transactionid, which has repetitive values. What I need to do is to count them distinctly. No matter how many times a value is repeated, it should be counted only once.

5
  • it would be more easier, if you could just give a glimpse of format of the file..not necessarily the data. Commented Jan 11, 2012 at 14:20
  • btw, do you want the value to be counted as 1 irrespective of how many times it exists, or you want the count of the number of occurrences/repetitions? if you just want it to be counted once, then how the distinct values are counted? Can you please check my edit on your question and confirm if I am right in interpreting. Commented Jan 11, 2012 at 14:27
  • @Nikhil This is clear from the question: ... No matter of how many times a value is repeated, it should be counted as 1. ...
    – user13742
    Commented Jan 11, 2012 at 14:28
  • ok, then answer from @hesse would do your need. Commented Jan 11, 2012 at 14:30
  • sorry for latency. I was out of internet connection. seperator is 2|' and field is field 28. I used; cat <file_name> | awk -F"|" '{if ((substr($2,0,8)=='20120110')) print $28}' | sort -u | wc -l the if clause was for another check of date as it seems obvious :)
    – Olgun Kaya
    Commented Jan 12, 2012 at 6:29

3 Answers 3

44

OK, Assuming that your file is a text file, having the fields separated by comma separator ','. You would also know which field 'transactionid' is in terms of its position. Assuming that your 'transactionid' field is 7th field.

awk -F ',' '{print $7}' text_file | sort | uniq -c

This would count the distinct/unique occurrences in the 7th field and prints the result.

2
  • Why sort before uniq command.
    – g10guang
    Commented Dec 6, 2019 at 1:37
  • 1
    @g10guang Becasue for uniq to eliminate records they need to be next to each other.
    – dsz
    Commented Jan 13, 2020 at 3:43
7

Maybe not the sleekest method, but this should work:

awk '{print $1}' your_file | sort | uniq | wc -l

where $1 is the number corresponding to the field to be parsed.

2
  • that gives you the number of distinct values plus 1 (header). How can we avoid the counter to include header?
    – neverMind
    Commented Mar 3, 2021 at 19:37
  • @neverMind Just tell awk to exclude the headers from printing: awk 'NR>1 {print $1}' | ...
    – gented
    Commented May 21, 2021 at 10:14
3

There is no need to sort the file .. (uniq requires the file to be sorted)
This awk script assumes the field is the first whitespace delimiited field.

awk 'a[$1] == "" { a[$1]="X" } END { print length(a) }' file 
1
  • 1
    For a huge file (as in, getting close to the size of RAM), awk will consume a lot of memory. Most sort implementations are designed to cope well with huge files. Commented Jan 12, 2012 at 1:59

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .