124 2023 UnixForPoets - Ec

Download as pdf or txt
Download as pdf or txt
You are on page 1of 39

CS 124/LINGUIST 180

From Languages to
Information

Unix for Poets


Dan Jurafsky
(original by Ken Church, modifications by
me and Chris Manning)
Stanford University
Unix for Poets
• Text is everywhere
• The Web
• Dictionaries, corpora, email, etc.
• Billions and billions of words
• What can we do with it all?
• It is better to do something simple, than nothing at all.
• You can do simple things from a Unix command-line
• Sometimes it’s much faster even than writing a quick python tool
2
• DIY is very satisfying
Exercises we’ll be doing today
1. Count words in a text
2. Sort a list of words in various ways
• ascii order
• ‘‘rhyming’’ order
3. Extract useful info from a dictionary
4. Compute ngram statistics

3
Tools
• grep: search for a pattern • cut (columns in tab-separated
(regular expression) files)
• sort • paste (paste columns)
• uniq –c (count duplicates) • head
• tr (translate characters) • tail
• wc (word – or line – count) • rev (reverse lines)
• sed (edit string -- replacement) • comm
• cat (send file(s) in stream) • join
• echo (send text in stream)
4
Prereq: If you are on a Mac:
• Open the Terminal app

5
Prereq: If you are on a Windows 10 machine and
don't have Ubuntu on your machine:
• For today's class, it's easiest to work with someone who has a Mac
or Linux machine, or has Ubuntu already.
• Or you can do the following so that you will have this ability:
• Watch the first 9 minutes of Bryan's lovely pa0 video about how to download and install Ubuntu:
https://canvas.stanford.edu/courses/144170/modules/items/981067

• Watch Chris Gregg's excellent UNIX videos here: Logging in, the first 7 "File System" videos, and the first 8 "useful
commands" videos.
https://web.stanford.edu/class/archive/cs/cs107/cs107.1186/unixref/
• From there you can use the ssh command to connect to the myth machines. Just be sure to keep track in your own
mind of whether you're on myth or your own laptop at any given moment! The ssh command you want to type is:
• ssh [sunet]@rice.stanford.edu where [sunet] is your SUNet ID. It will ask for your password, which is your usual
6 Stanford password, and you will have to do two-step authentication.
Prerequisites: get the text file we are using
• rice: ssh into a rice or myth and then do (don't forget the
final ".")
cp /afs/ir/class/cs124/WWW/nyt_200811.txt .
• Or download to your own Mac or Unix laptop this file:
http://cs124.stanford.edu/nyt_200811.txt
Or:
scp cardinal:/afs/ir/class/cs124/WWW/nyt_200811.txt .
7
Prerequisites
• The unix “man” command
• e.g., man tr
• Man shows you the command options; it's not
particularly friendly

8
Prerequisites
• How to chain shell commands and deal
with input/output
• Input/output redirection:
•> “output to a file”
•< ”input from a file”
•| “pipe”
• CTRL-C
• The less command (quit by typing "q")
OK, you're ready to start!
• PollEv.com/danjurafsky451 for questions

10
Exercise 1: Count words in a text
• Input: text file (nyt_200811.txt)
• Output: list of words in the file with freq counts
• Algorithm
1. Tokenize (tr)
2. Sort (sort)
3. Count duplicates (uniq –c)
• Go read the man pages and figure out how to pipe these
11 together
Solution to Exercise 1
• tr -sc 'A-Za-z' '\n' < nyt_200811.txt |
sort | uniq -c

633 A
1 AA
1 AARP (Do you get a different sort order?
1 ABBY In some versions of UNIX, sort doesn't
41 ABC use ASCII order (uppercase before
lowercase).)
1 ABCNews
12
Some of the output
• tr -sc 'A-Za-z' '\n' • tr -sc 'A-Za-z' '\n'
< nyt_200811.txt | < nyt_200811.txt |
sort | uniq -c | sort | uniq -c |
head –n 5 head
633 A • head gives you the first 10
1 AA lines
1 AARP • tail does the same with
1 ABBY the end of the input
41 ABC • (You can omit the “-n” but
Extended Counting Exercises
1. Merge upper and lower case by downcasing
everything
• Hint: Put in a second tr command

2. How common are different sequences of vowels (e.g.,


the sequences "ieu" or just "e" in "lieutenant")?
• Hint: Put in a second tr command
14
Solutions
Merge upper and lower case by downcasing everything
tr -sc 'A-Za-z' '\n' < nyt_200811.txt | tr 'A-Z' 'a-z'
| sort | uniq -c
or
tr -sc 'A-Za-z' '\n' < nyt_200811.txt | tr '[:upper:]'
'[:lower:]' | sort | uniq -c

1. tokenize by replacing the complement of letters with newlines


2. replace all uppercase with lowercase
3. sort alphabetically
4. merge duplicates and show counts
Solutions
• How common are different sequences of vowels (e.g., ieu)
• tr 'A-Z' 'a-z' < nyt_200811.txt | tr -sc
'aeiou' '\n' | sort | uniq -c

16
https://tinyurl.com/ycyubzs8

passwd:
• mango
17
Sorting and reversing lines of text
• sort
• sort –f Ignore case
• sort –n Numeric order
• sort –r Reverse sort
• sort –nr Reverse numeric sort

• echo "Hello" | rev


18
Counting and sorting exercises
• Find the 50 most common words in the NYT
• Hint: Use sort a second time, then head

• Find the words in the NYT that end in "zz"


• Hint: Look at the end of a list of reversed words
• tr 'A-Z' 'a-z' < filename | tr –sc 'a-z' '\n' | rev | sort | rev |
uniq -c
19
Counting and sorting exercises
• Find the 50 most common words in the NYT
tr -sc 'A-Za-z' '\n' < nyt_200811.txt |
sort | uniq -c | sort -nr | head -n 50

• Find the words in the NYT that end in "zz"


tr -sc 'A-Za-z' '\n' < nyt_200811.txt | tr
'A-Z' 'a-z' | rev | sort | uniq -c | rev |
20
tail -n 10
Lesson
• Piping commands together can be simple yet powerful
in Unix
• It gives flexibility.

• Traditional Unix philosophy: small tools that can be


composed
21
Bigrams = word pairs and their counts
Algorithm:
1. Tokenize by word
2. Create two almost-duplicate files of words, off
by one line, using tail
3. paste them together so as to
get wordi and wordi +1 on the same line
4. Count 22
Bigrams
• tr -sc 'A-Za-z' '\n' < nyt_200811.txt >
nyt.words
• tail -n +2 nyt.words > nyt.nextwords
• paste nyt.words nyt.nextwords > nyt.bigrams
• head –n 5 nyt.bigrams
KBR said
said Friday
Friday the
the global
23 global economic
Exercises
• Find the 10 most common bigrams
• (For you to look at:) What part-of-speech pattern are most of
them?

• Find the 10 most common trigrams

24
Solutions
• Find the 10 most common bigrams
tr 'A-Z' 'a-z' < nyt.bigrams | sort | uniq
-c | sort -nr | head -n 10
• Find the 10 most common trigrams
tail -n +3 nyt.words > nyt.thirdwords
paste nyt.words nyt.nextwords nyt.thirdwords >
nyt.trigrams
cat nyt.trigrams | tr "[:upper:]" "[:lower:]" | sort |
uniq -c | sort -rn | head -n 10
25
grep
• Grep finds patterns specified as regular expressions
• grep rebuilt nyt_200811.txt
Conn and Johnson, has been rebuilt, among the first of the 222
move into their rebuilt home, sleeping under the same roof for the
the part of town that was wiped away and is being rebuilt. That is
to laser trace what was there and rebuilt it with accuracy," she
home - is expected to be rebuilt by spring. Braasch promises that a
26
grep
• Grep finds patterns specified as regular expressions
• globally search for regular expression and print

• Finding words ending in –ing:


• grep 'ing$' nyt.words |sort | uniq –c

27
grep
• grep is a filter – you keep only some lines of the input
• grep gh keep lines containing ‘‘gh’’
• grep 'ˆcon' keep lines beginning with ‘‘con’’
• grep 'ing$' keep lines ending with ‘‘ing’’
• grep –v gh keep lines NOT containing “gh”
grep versus egrep (grep –E)
• egrep or grep -E [extended syntax]
• In egrep, +, ?, |, (, and ) are automatically metacharacters
• In grep, you have to backslash them
• To find words ALL IN UPPERCASE:
• egrep '^[A-Z]+$' nyt.words |sort|uniq -c
• == grep '^[A-Z]\+$' nyt.words |sort|uniq -c

(confusingly on some systems grep acts like


egrep)
Counting lines, words, characters
• wc nyt_200811.txt
70334 509851 3052306 nyt_200811.txt

• wc -l nyt.words
70334 nyt_200811.txt

Exercise: Why is the number of words different?


30
Exercises on grep & wc
• How many all uppercase words are there in this NYT file?
• How many 4-letter words?
• How many different words are there with no vowels
• What subtypes do they belong to?
• How many “1 syllable” words are there
• That is, ones with exactly one sequence of vowels

Type/token distinction: different words (types) vs. instances (tokens)


31
Solutions on grep & wc
• How many all uppercase words are there in this NYT file?
grep -E '^[A-Z]+$' nyt.words | wc
• How many 4-letter words?
grep -E '^[a-zA-Z]{4}$' nyt.words | wc
• How many different words are there with no vowels
grep -v '[AEIOUaeiou]' nyt.words | sort | uniq | wc
• How many “1 syllable” words are there
• tr 'A-Z' 'a-z' < nyt.words | grep -E
'^[^aeiou]*[aeiou]+[^aeiou]*$' | uniq | wc
Type/token distinction: different words (types) vs. instances (tokens)
sed
• sed is used when you need to make systematic changes to
strings in a file (larger changes than ‘tr’)
• It’s line based: you optionally specify a line (by regex or line
numbers) and specific a regex substitution to make
• For example to change all cases of “George” to “Jane”:

• sed 's/George/Jane/' nyt_200811.txt | less

33
sed exercises
• Count frequency of word initial consonant sequences
• Take tokenized words
• Delete the first vowel through the end of the word
• Sort and count

• Count word final consonant sequences

34
sed exercises
• Count frequency of word initial consonant sequences
tr "[:upper:]" "[:lower:]" < nyt.words | sed
's/[aeiou].*$//' | sort | uniq -c

• Count word final consonant sequences


tr "[:upper:]" "[:lower:]" < nyt.words | sed
's/^.*[aeiou]//' | sort | uniq -c | sort -rn
| less

35
Extra Credit – Secret Message
• Now, let’s get some more practice with Unix!
• The answers to the extra credit exercises will reveal a secret
message.
• We will be working with the following text file for these
exercises:
https://web.stanford.edu/class/cs124/lec/secret_ec.txt
• To receive credit, enter the secret message here:
https://forms.gle/57okKzZzWeijP4RL7
36
Extra Credit Exercise 1
• Find the 2 most common words in secret_ec.txt containing the
letter e.
• Your answer will correspond to the first two words of the secret
message.

37
Extra Credit Exercise 2
• Find the 2 most common bigrams in secret_ec.txt where the
second word in the bigram ends with a consonant.
• Your answer will correspond to the next four words of the secret
message.

38
Extra Credit Exercise 3
• Find all 5-letter-long words that only appear once in secret_ec.txt.
• Concatenate (by hand) your result. This will be the final word of
the secret message.

39

You might also like