Unix For Poets
Unix For Poets
Unix For Poets
124/LINGUIST
180
From
Languages
to
Informa<on
Christopher Manning
Christopher Manning
Exercises
to
be
addressed
1. Count
words
in
a
text
2. Sort
a
list
of
words
in
various
ways
1. ascii
order
2. rhyming
order
Christopher Manning
Tools
grep:
search
for
a
paEern
(regular
expression)
sort
uniq
c
(count
duplicates)
tr
(translate
characters)
wc
(word
or
line
count)
sed
(edit
string
--
replacement)
cat
(send
le(s)
in
stream)
echo
(send
text
in
stream)
4
Christopher Manning
Prerequisites
CTRL-C
5
Christopher Manning
Christopher Manning
Solu<on
to
Exercise
1
tr
-sc
A-Za-z
\n
<
nyt_200811.txt
|
sort
|
uniq
-c
25476
a
1271
A
3
AA
3
AAA
1
Aalborg
1
Aaliyah
1
Aalto
2
aardvark
Christopher Manning
Christopher Manning
Christopher Manning
sort
sort
f
sort
n
sort
r
sort
nr
Ignore
case
Numeric
order
Reverse
sort
Reverse
numeric
sort
Christopher Manning
11
Christopher Manning
Lesson
Piping
commands
together
can
be
simple
yet
powerful
in
Unix
It
gives
exibility.
12
Christopher Manning
13
Christopher Manning
Bigrams
14
Christopher Manning
Exercises
Find
the
10
most
common
bigrams
(For
you
to
look
at:)
What
part-of-speech
paEern
are
most
of
them?
15
Christopher Manning
grep
Grep
nds
paEerns
specied
as
regular
expressions
grep
rebuilt
nyt_200811.txt
Conn
and
Johnson,
has
been
rebuilt,
among
the
rst
of
the
222
move
into
their
rebuilt
home,
sleeping
under
the
same
roof
for
the
the
part
of
town
that
was
wiped
away
and
is
being
rebuilt.
That
is
to
laser
trace
what
was
there
and
rebuilt
it
with
accuracy,"
she
home
-
is
expected
to
be
rebuilt
by
spring.
Braasch
promises
that
a
the
anonymous
places
where
the
country
will
have
to
be
rebuilt,
"The
party
will
not
be
rebuilt
without
moderates
being
a
part
of
16
Christopher Manning
grep
Grep
nds
paEerns
specied
as
regular
expressions
globally
search
for
regular
expression
and
print
17
Christopher Manning
grep
grep
-P
Perl
regular
expressions
(extended
syntax)
grep
-P
'^[A-Z]+$'
nyt.words
|
sort
|
uniq
c
ALL
UPPERCASE
18
Christopher Manning
19
Christopher Manning
Christopher Manning
sed
sed
is
a
simple
string
(i.e.,
lines
of
a
le)
editor
You
can
match
lines
of
a
le
by
regex
or
line
numbers
and
make
changes
Not
much
used
in
2013,
but
The
general
regex
replace
func=on
s=ll
comes
in
handy
sed
's/George
Bush/Dubya/'
nyt_200811.txt
|
less
21
Christopher Manning
sed
exercises
Count
frequency
of
word
ini=al
consonant
sequences
Take
tokenized
words
Delete
the
rst
vowel
through
the
end
of
the
word
Sort
and
count
22
Christopher Manning
awk
Ken
Churchs
slides
then
describe
awk,
a
simple
programming
language
for
short
programs
on
data
usually
in
elds
I
honestly
dont
think
its
worth
learning
awk
in
2013
BeEer
to
write
liEle
programs
in
your
favorite
scrip=ng
language,
be
that
Python,
or
Perl,
or
groovy,
or
23
Christopher Manning
shuf
Randomly
permutes
(shues)
the
lines
of
a
le
Exercises
Print
10
random
word
tokens
from
the
NYT
excerpt
10
instances
of
words
that
appear,
each
word
instance
equally
likely
Print
10
random
word
types
from
the
NYT
excerpt
10
dierent
words
that
appear,
each
dierent
word
equally
likely
24
Christopher Manning
Christopher Manning
Christopher Manning
cut
exercises
How
oen
is
that
used
as
a
determiner
(DT)
that
man
versus
a
complemen=zer
(IN)
I
know
that
he
is
rich
versus
a
rela=ve
(WDT)
The
class
that
I
love
Hint:
With
grep
P,
you
can
use
\t
for
a
tab
character