Count distinct values of a field in a file

Question

I have a file with one million lines. Each line has a field called transactionid, which has repetitive values. What I need to do is to count them distinctly. No matter how many times a value is repeated, it should be counted only once.

it would be more easier, if you could just give a glimpse of format of the file..not necessarily the data. — Nikhil Mulley, Commented Jan 11, 2012 at 14:20
btw, do you want the value to be counted as 1 irrespective of how many times it exists, or you want the count of the number of occurrences/repetitions? if you just want it to be counted once, then how the distinct values are counted? Can you please check my edit on your question and confirm if I am right in interpreting. — Nikhil Mulley, Commented Jan 11, 2012 at 14:27
@Nikhil This is clear from the question: ... No matter of how many times a value is repeated, it should be counted as 1. ... — user13742, Commented Jan 11, 2012 at 14:28
sorry for latency. I was out of internet connection. seperator is 2|' and field is field 28. I used; cat <file_name> | awk -F"|" '{if ((substr($2,0,8)=='20120110')) print $28}' | sort -u | wc -l the if clause was for another check of date as it seems obvious :) — Olgun Kaya, Commented Jan 12, 2012 at 6:29

phk · Accepted Answer · 2017-05-01 23:06:28Z

44

OK, Assuming that your file is a text file, having the fields separated by comma separator ','. You would also know which field 'transactionid' is in terms of its position. Assuming that your 'transactionid' field is 7th field.

awk -F ',' '{print $7}' text_file | sort | uniq -c

This would count the distinct/unique occurrences in the 7th field and prints the result.

edited May 1, 2017 at 23:06

phk

6,0137 gold badges43 silver badges74 bronze badges

answered Jan 11, 2012 at 14:21

Nikhil Mulley

8,35534 silver badges50 bronze badges

Why sort before uniq command.
– g10guang
Commented Dec 6, 2019 at 1:37
1

@g10guang Becasue for uniq to eliminate records they need to be next to each other.
– dsz
Commented Jan 13, 2020 at 3:43

Add a comment |

score 7 · Accepted Answer · 2012-01-11 14:26:03Z

7

Maybe not the sleekest method, but this should work:

awk '{print $1}' your_file | sort | uniq | wc -l

where $1 is the number corresponding to the field to be parsed.

edited Jan 11, 2012 at 14:26

answered Jan 11, 2012 at 14:18

user13742

that gives you the number of distinct values plus 1 (header). How can we avoid the counter to include header?
– neverMind
Commented Mar 3, 2021 at 19:37
@neverMind Just tell awk to exclude the headers from printing: awk 'NR>1 {print $1}' | ...
– gented
Commented May 21, 2021 at 10:14

Add a comment |

Peter.O · Accepted Answer · 2012-01-11 14:57:34Z

3

There is no need to sort the file .. (uniq requires the file to be sorted)
This awk script assumes the field is the first whitespace delimiited field.

awk 'a[$1] == "" { a[$1]="X" } END { print length(a) }' file

edited Jan 11, 2012 at 14:57

answered Jan 11, 2012 at 14:30

Peter.O

33.3k31 gold badges118 silver badges166 bronze badges

1

For a huge file (as in, getting close to the size of RAM), awk will consume a lot of memory. Most sort implementations are designed to cope well with huge files.
– Gilles 'SO- stop being evil'
Commented Jan 12, 2012 at 1:59

Add a comment |

Stack Exchange Network

Count distinct values of a field in a file

3 Answers 3

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
text-processing
awk
.

Linked

Hot Network Questions

Count distinct values of a field in a file

3 Answers 3

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged text-processingawk.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
text-processing
awk
.