124 2023 UnixForPoets - Ec
124 2023 UnixForPoets - Ec
124 2023 UnixForPoets - Ec
From Languages to
Information
3
Tools
• grep: search for a pattern • cut (columns in tab-separated
(regular expression) files)
• sort • paste (paste columns)
• uniq –c (count duplicates) • head
• tr (translate characters) • tail
• wc (word – or line – count) • rev (reverse lines)
• sed (edit string -- replacement) • comm
• cat (send file(s) in stream) • join
• echo (send text in stream)
4
Prereq: If you are on a Mac:
• Open the Terminal app
5
Prereq: If you are on a Windows 10 machine and
don't have Ubuntu on your machine:
• For today's class, it's easiest to work with someone who has a Mac
or Linux machine, or has Ubuntu already.
• Or you can do the following so that you will have this ability:
• Watch the first 9 minutes of Bryan's lovely pa0 video about how to download and install Ubuntu:
https://canvas.stanford.edu/courses/144170/modules/items/981067
• Watch Chris Gregg's excellent UNIX videos here: Logging in, the first 7 "File System" videos, and the first 8 "useful
commands" videos.
https://web.stanford.edu/class/archive/cs/cs107/cs107.1186/unixref/
• From there you can use the ssh command to connect to the myth machines. Just be sure to keep track in your own
mind of whether you're on myth or your own laptop at any given moment! The ssh command you want to type is:
• ssh [sunet]@rice.stanford.edu where [sunet] is your SUNet ID. It will ask for your password, which is your usual
6 Stanford password, and you will have to do two-step authentication.
Prerequisites: get the text file we are using
• rice: ssh into a rice or myth and then do (don't forget the
final ".")
cp /afs/ir/class/cs124/WWW/nyt_200811.txt .
• Or download to your own Mac or Unix laptop this file:
http://cs124.stanford.edu/nyt_200811.txt
Or:
scp cardinal:/afs/ir/class/cs124/WWW/nyt_200811.txt .
7
Prerequisites
• The unix “man” command
• e.g., man tr
• Man shows you the command options; it's not
particularly friendly
8
Prerequisites
• How to chain shell commands and deal
with input/output
• Input/output redirection:
•> “output to a file”
•< ”input from a file”
•| “pipe”
• CTRL-C
• The less command (quit by typing "q")
OK, you're ready to start!
• PollEv.com/danjurafsky451 for questions
10
Exercise 1: Count words in a text
• Input: text file (nyt_200811.txt)
• Output: list of words in the file with freq counts
• Algorithm
1. Tokenize (tr)
2. Sort (sort)
3. Count duplicates (uniq –c)
• Go read the man pages and figure out how to pipe these
11 together
Solution to Exercise 1
• tr -sc 'A-Za-z' '\n' < nyt_200811.txt |
sort | uniq -c
633 A
1 AA
1 AARP (Do you get a different sort order?
1 ABBY In some versions of UNIX, sort doesn't
41 ABC use ASCII order (uppercase before
lowercase).)
1 ABCNews
12
Some of the output
• tr -sc 'A-Za-z' '\n' • tr -sc 'A-Za-z' '\n'
< nyt_200811.txt | < nyt_200811.txt |
sort | uniq -c | sort | uniq -c |
head –n 5 head
633 A • head gives you the first 10
1 AA lines
1 AARP • tail does the same with
1 ABBY the end of the input
41 ABC • (You can omit the “-n” but
Extended Counting Exercises
1. Merge upper and lower case by downcasing
everything
• Hint: Put in a second tr command
16
https://tinyurl.com/ycyubzs8
passwd:
• mango
17
Sorting and reversing lines of text
• sort
• sort –f Ignore case
• sort –n Numeric order
• sort –r Reverse sort
• sort –nr Reverse numeric sort
24
Solutions
• Find the 10 most common bigrams
tr 'A-Z' 'a-z' < nyt.bigrams | sort | uniq
-c | sort -nr | head -n 10
• Find the 10 most common trigrams
tail -n +3 nyt.words > nyt.thirdwords
paste nyt.words nyt.nextwords nyt.thirdwords >
nyt.trigrams
cat nyt.trigrams | tr "[:upper:]" "[:lower:]" | sort |
uniq -c | sort -rn | head -n 10
25
grep
• Grep finds patterns specified as regular expressions
• grep rebuilt nyt_200811.txt
Conn and Johnson, has been rebuilt, among the first of the 222
move into their rebuilt home, sleeping under the same roof for the
the part of town that was wiped away and is being rebuilt. That is
to laser trace what was there and rebuilt it with accuracy," she
home - is expected to be rebuilt by spring. Braasch promises that a
26
grep
• Grep finds patterns specified as regular expressions
• globally search for regular expression and print
27
grep
• grep is a filter – you keep only some lines of the input
• grep gh keep lines containing ‘‘gh’’
• grep 'ˆcon' keep lines beginning with ‘‘con’’
• grep 'ing$' keep lines ending with ‘‘ing’’
• grep –v gh keep lines NOT containing “gh”
grep versus egrep (grep –E)
• egrep or grep -E [extended syntax]
• In egrep, +, ?, |, (, and ) are automatically metacharacters
• In grep, you have to backslash them
• To find words ALL IN UPPERCASE:
• egrep '^[A-Z]+$' nyt.words |sort|uniq -c
• == grep '^[A-Z]\+$' nyt.words |sort|uniq -c
• wc -l nyt.words
70334 nyt_200811.txt
33
sed exercises
• Count frequency of word initial consonant sequences
• Take tokenized words
• Delete the first vowel through the end of the word
• Sort and count
34
sed exercises
• Count frequency of word initial consonant sequences
tr "[:upper:]" "[:lower:]" < nyt.words | sed
's/[aeiou].*$//' | sort | uniq -c
35
Extra Credit – Secret Message
• Now, let’s get some more practice with Unix!
• The answers to the extra credit exercises will reveal a secret
message.
• We will be working with the following text file for these
exercises:
https://web.stanford.edu/class/cs124/lec/secret_ec.txt
• To receive credit, enter the secret message here:
https://forms.gle/57okKzZzWeijP4RL7
36
Extra Credit Exercise 1
• Find the 2 most common words in secret_ec.txt containing the
letter e.
• Your answer will correspond to the first two words of the secret
message.
37
Extra Credit Exercise 2
• Find the 2 most common bigrams in secret_ec.txt where the
second word in the bigram ends with a consonant.
• Your answer will correspond to the next four words of the secret
message.
38
Extra Credit Exercise 3
• Find all 5-letter-long words that only appear once in secret_ec.txt.
• Concatenate (by hand) your result. This will be the final word of
the secret message.
39