A Short Introduction To Unix For Bioinformatics
A Short Introduction To Unix For Bioinformatics
A Short Introduction To Unix For Bioinformatics
@HWI-EAS305:1:1:1:991#0/1 GCTGGAGGTTCAGGCTGGCCGGATTTAAACGTAT +HWI-EAS305:1:1:1:991#0/1 MVXUWVRKTWWULRQQMMWWBBBBBBBBBBBBBB @HWI-EAS305:1:1:1:201#0/1 AAGACAAAGATGTGCTTTCTAAATCTGCACTAAT +HWI-EAS305:1:1:1:201#0/1 PXX[[[[XTXYXTTWYYY[XXWWW[TMTVXWBBB @HWI-EAS305:1:1:1:70#0/1 ATGATAATAATACCTCTTGCAGTTTGCATCATGT +HWI-EAS305:1:1:1:70#0/1 OYYYY[[Z[YZ[Y[[WYYP[YYYWZTZ[ZYBBBB @HWI-EAS305:1:1:1:983#0/1 ACCCAATACCGGTACAGGAATTGCAGCAATCAAA +HWI-EAS305:1:1:1:983#0/1 OYYXVVYYYYUVUYXVRSUYYUTUTPVYYYYVVT @HWI-EAS305:1:1:1:1671#0/1 AATTACACAACAAAAGGAGATCAAAGGGATACAA +HWI-EAS305:1:1:1:1671#0/1 OY[[[[YXXX[[[ZXUWXZZ[Y[[VTTUU[[[YX @HWI-EAS305:1:1:1:1699#0/1 GTTGGCTCTACGATCACGTTGCTCACCATGTGGG +HWI-EAS305:1:1:1:1699#0/1 JSUSSUWTWWQUUUVVUUUTTTUTUWBBBBBBBB @HWI-EAS305:1:1:1:1616#0/1 ATTGGCGACGATATTCAGGTCCATGTTTCTTGCG +HWI-EAS305:1:1:1:1616#0/1 NYYYVRVVVSVYYYYWUNNPWVYBBBBBBBBBBB @HWI-EAS305:1:1:1:1755#0/1 CAGTCACGAATCGGTGCGTCTTTCACCTGACACA +HWI-EAS305:1:1:1:1755#0/1 PWVWWVTKTWWWPMTRWQWUWWUUUSSODORSUU @HWI-EAS305:1:1:1:1046#0/1 CCCTGACAGGTAACAGGAGGTGGCTGGGGCTGAG +HWI-EAS305:1:1:1:1046#0/1 BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
[browns02@sanger]$ more s_1_sequence.txt . [browns02@sanger]$ wc s_1_sequence.txt 32806276 32806276 1062955212 s_1_sequence.txt [browns02@sanger]$ grep "@" s_1_sequence.txt | wc 8201569 8201569 244422691
Unix Advantages
It is very popular, so it is easy to find information and get help
pick up books at the local bookstore (or street vendor) plenty of helpful websites USENET discussions and e-mail lists most Comp. Sci. students know Unix
Most new bioinformatics software is created for Unix - its easy for the programmers
There are many different versions of Unix with subtle (or not so subtle) differences
Free Software
Linux operating system, mySY QL database Perl - programming language Blast and Fasta - similarity search Clustal - multiple alignment Phylip - phylogenetics Phred/Phrap/Consed - sequence assembly and SNP detection EMBOSS - a complete sequence analysis package created by the EMBL (like GCG)
Simple Programs
You can use the Unix shell to run programs right from the command line, or save them as shell scripts. Simple loops can run a program (such as Blast or FASTA) on many sequence files. Then you can check the output files for specific results, and use if statements to sort or take other actions More about this next week.
Unix Commands
Unix commands are short and cryptic like vi or rm.
Computer geeks like it that way; you will get used to it.
Every command has a host of modifiers which are generally single letters preceded by a hyphen: ls -l or mv -R
Capital letters have different functions than small letters, often completely unrelated. A command also generally requires an argument, meaning some file on which it will act:
cat -n mygene.seq
Wildcards
You can substitute the * as a wildcard symbol for any number of characters in any filename. If you type just * after a command, it stands for all files in the current directory:
lpr * will print all files
You can mix the * with other characters to form a search pattern:
ls a*.txt will list all files that start with a and end in .txt will copy draft1.doc, draft2.doc, draftb.doc, etc.
Typing Mistakes
Unix is remarkably unforgiving of typing mistakes
You can do a lot with just a few keystrokes, but it can be hard or impossible to undo
Control Characters
You type Control characters by holding down the control key while also pressing the specified character. While you are typing a command:
ctrl-W erases the previous word ctrl-U erases the whole command line
There is a rudimentary Help system which consists of a set of "manual pages for every Unix command. The man pages tell you which options a particular command can take, and how each option modifies the behavior of the command. Type man and the name of a command to read the manual page for that command.
ls(1)
NAME ls - Lists and generates statistics for files SYNOPSIS
ls(1)
ls [-aAbcCdfFgilLmnopqrRstux1] [file...|directory...] STANDARDS Interfaces documented on this reference page conform to industry standards as follows: ls: XPG4, XPG4-UNIX Refer to the standards(5) reference page for more information about industry standards and associated tags. OPTIONS -a Lists all entries in the directory, including the entries that begin with a . (dot). Entries that begin with a . are not displayed unless you refer to them specifically, or you specify the -a option. -A [Compaq] Lists all entries, except . (dot) and .. (dot-dot). If you issue the ls command as the superuser, it behaves as if you specified this option. -b [Compaq] Displays nonprintable characters in octal notation. -c Uses the time of last inode modification (file created, mode changed, and so on) for sorting when used with the -t option. Displays the time of last inode modification (instead of the time at which the file's contents were last modified) when used with the -l option. This option has effect only when used with either -t or -l or both.
manaacsba (10%)
You can change your password with the passwd command. You can create a .login file in your home directory that executes any set of Unix commands every time that you login.
Unix Filenames
Unix is cAsE sEnsItiVe UNIX filenames contain only letters, numbers, and the _ (underscore), . (dot), and (dash) characters. Unix does not allow two files to exist in the same directory with the same name.
Whenever a situation occurs where a file is about to be created or copied into a directory where another file has that exact same name, the new file will overwrite (and delete) the older file. Unix will generally alert you when this is about to happen, but it is easy to ignore the warning.
Filename Extensions
Most UNIX filenames start with a lower case letter and end with a dot followed by one, two, or three letters: myfile.txt
However, this is just a common convention and is not required. It is also possible to have additional dots in the filename.
The part of the name following the dot is called the extension. The extension is often used to designate the type of file.
Unix does not require these extensions (unlike Windows), but it is a sensible idea and one that you should follow
Directories contain files, executable programs, and sub-directories Understanding how to use directories is crucial to manipulating your files on a Unix system.
All of these commands can be modified with many options. Learn to use Unix man pages for more information.
Navigation
pwd (present working directory) shows the name and location of the directory where you are currently working: > pwd
/u/browns02
This is a pathname, the slashes indicate sub-directories The initial slash is the root of the whole filesytem
ls (list) gives you a list of the files in the current directory: > ls
assembin4.fasta Misc test2.txt bin temp testfile
Use the ls -l (long) option to get more information about each file
> ls -l
total 1768 drwxr-x--- 2 browns02 users 8192 Aug 28 18:26 Opioid -rw-r----- 1 browns02 users 6205 May 30 2000 af124329.gb_in2 -rw-r----- 1 browns02 users 131944 May 31 2000 af151074.fasta
Sub-directories
cd (change directory) moves you to another directory
>cd Misc > pwd /u/browns02/Misc
mkdir (make directory) creates a new sub-directory inside of the current directory
> ls assembler phrap > mkdir subdir > ls assembler phrap space space subdir
rmdir (remove directory) deletes a subdirectory, but the sub-directory must be empty
> rmdir subdir
space
Shortcuts
There are some important shortcuts in Unix for specifying directories . (dot) means "the current directory"
.. means "the parent directory" - the directory one level above the current directory, so cd .. will move you up one level ~ (tilde) means your Home directory, so cd ~ will move you back to your Home.
Just typing a plain cd will also bring you back to your home directory
Write and delete privilege are the same on a Unix system since write privilege allows someone to overwrite a file with a different one.
The username of the owner is shown in the third column. (The owner of the files listed above is browns02) The owner belongs to the group users
The access rights for these files is shown in the first column. This column consists of 10 characters known as the attributes of the file: r, w, x, and r w x indicates read permission indicates write (and delete) permission indicates execute (run) permission indicates no permission for that operation
> ls -l
drwxr-x--- 2 browns02 users 8192 Aug 28 18:26 Opioid -rw-r----- 1 browns02 users 6205 May 30 2000 af124329.gb_in2 -rw-r----- 1 browns02 users 131944 May 31 2000 af151074.fasta
The first character in the attribute string indicates if a file is a directory (d) or a regular file (-). The next 3 characters (rwx) give the file permissions for the owner of the file. The middle 3 characters give the permissions for other members of the owner's group. The last 3 characters give the permissions for everyone else (others) The default protections assigned to new files on our system is: -rw-r----- (owner=read and write, group =read, others=nothing)
Change Protections
Only the owner of a file can change its protections To change the protections on a file use the chmod (change mode) command.
[Beware, this is a confusing command.] First you have to decide for whom you will change the access permissions:
the file owner (u) the members of your group (g) others (o) (ie. anyone with an RCR account)
Next you have to decide if you are adding (+), removing (-), or setting (=) permissions.
more
Use the command more to view at the contents of a file one screen at a time:
> more t27054_cel.pep !!AA_SEQUENCE 1.0 P1;T27054 - hypothetical protein Y49E10.20 - Caenorhabditis elegans Length: 534 May 30, 2000 13:49 Type: P Check: 1278 .. 1 MLKKAPCLFG SAIILGLLLA AAGVLLLIGI PIDRIVNRQV IDQDFLGYTR 51 DENGTEVPNA MTKSWLKPLY AMQLNIWMFN VTNVDGILKR HEKPNLHEIG 101 PFVFDEVQEK VYHRFADNDT RVFYKNQKLY HFNKNASCPT CHLDMKVTIP
t27054_cel.pep (87%)
More sophisticated options for viewing text files are available in a text editor (next week).
mv allows you to move files to other directories, but it is also used to rename files.
Filename and directory syntax for mv is exactly the same as for the cp command.
mv filename.ext subdir/newfilename.ext
NOTE: When you use mv to move a file into another directory, the current file is deleted.
Delete
Use the command rm (remove) to delete files There is no way to undo this command!!!
We have set the server to ask if you really want to remove each file before it is deleted. You must answer Y or else the file is not deleted.
> ls af151074.gb_pr5 test.seq > rm test.seq rm: remove test.seq? y > ls af151074.gb_pr5
FTP is Simple
File Transfer Protocol is standard for all computers on any network. The best way to move lots of data to and from remote machines:
put raw data onto the server for analysis get results back to the desktop for use in papers and grants
FTP Login
When you open an FTP program, you connect to sanger just as you would with a terminal We now use sFTP (secure FTP) Your username and password are the same.
You will automatically end up in your home directory. Put files from you PC to the server, Get files from the server to your desktop machine.
@HWI-EAS305:1:1:1:991#0/1 GCTGGAGGTTCAGGCTGGCCGGATTTAAACGTAT +HWI-EAS305:1:1:1:991#0/1 MVXUWVRKTWWULRQQMMWWBBBBBBBBBBBBBB @HWI-EAS305:1:1:1:201#0/1 AAGACAAAGATGTGCTTTCTAAATCTGCACTAAT +HWI-EAS305:1:1:1:201#0/1 PXX[[[[XTXYXTTWYYY[XXWWW[TMTVXWBBB @HWI-EAS305:1:1:1:70#0/1 ATGATAATAATACCTCTTGCAGTTTGCATCATGT +HWI-EAS305:1:1:1:70#0/1 OYYYY[[Z[YZ[Y[[WYYP[YYYWZTZ[ZYBBBB @HWI-EAS305:1:1:1:983#0/1 ACCCAATACCGGTACAGGAATTGCAGCAATCAAA +HWI-EAS305:1:1:1:983#0/1 OYYXVVYYYYUVUYXVRSUYYUTUTPVYYYYVVT @HWI-EAS305:1:1:1:1671#0/1 AATTACACAACAAAAGGAGATCAAAGGGATACAA +HWI-EAS305:1:1:1:1671#0/1 OY[[[[YXXX[[[ZXUWXZZ[Y[[VTTUU[[[YX @HWI-EAS305:1:1:1:1699#0/1 GTTGGCTCTACGATCACGTTGCTCACCATGTGGG +HWI-EAS305:1:1:1:1699#0/1 JSUSSUWTWWQUUUVVUUUTTTUTUWBBBBBBBB @HWI-EAS305:1:1:1:1616#0/1 ATTGGCGACGATATTCAGGTCCATGTTTCTTGCG +HWI-EAS305:1:1:1:1616#0/1 NYYYVRVVVSVYYYYWUNNPWVYBBBBBBBBBBB @HWI-EAS305:1:1:1:1755#0/1 CAGTCACGAATCGGTGCGTCTTTCACCTGACACA +HWI-EAS305:1:1:1:1755#0/1 PWVWWVTKTWWWPMTRWQWUWWUUUSSODORSUU @HWI-EAS305:1:1:1:1046#0/1 CCCTGACAGGTAACAGGAGGTGGCTGGGGCTGAG +HWI-EAS305:1:1:1:1046#0/1 BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
[browns02@sanger]$ more s_1_sequence.txt [browns02@sanger]$ wc s_1_sequence.txt 32806276 32806276 1062955212 s_1_sequence.txt [browns02@sanger]$ grep "@" s_1_sequence.txt | wc 8201569 8201569 244422691 [browns02@sanger]$ grep A 1 "@ s_1_sequence.txt > s1.seq
@HWI-EAS305:1:1:1:991#0/1 GCTGGAGGTTCAGGCTGGCCGGATTTAAACGTAT +HWI-EAS305:1:1:1:991#0/1 MVXUWVRKTWWULRQQMMWWBBBBBBBBBBBBBB @HWI-EAS305:1:1:1:201#0/1 AAGACAAAGATGTGCTTTCTAAATCTGCACTAAT +HWI-EAS305:1:1:1:201#0/1 PXX[[[[XTXYXTTWYYY[XXWWW[TMTVXWBBB @HWI-EAS305:1:1:1:70#0/1 ATGATAATAATACCTCTTGCAGTTTGCATCATGT +HWI-EAS305:1:1:1:70#0/1 OYYYY[[Z[YZ[Y[[WYYP[YYYWZTZ[ZYBBBB @HWI-EAS305:1:1:1:983#0/1 ACCCAATACCGGTACAGGAATTGCAGCAATCAAA +HWI-EAS305:1:1:1:983#0/1 OYYXVVYYYYUVUYXVRSUYYUTUTPVYYYYVVT @HWI-EAS305:1:1:1:1671#0/1 AATTACACAACAAAAGGAGATCAAAGGGATACAA +HWI-EAS305:1:1:1:1671#0/1 OY[[[[YXXX[[[ZXUWXZZ[Y[[VTTUU[[[YX @HWI-EAS305:1:1:1:1699#0/1 GTTGGCTCTACGATCACGTTGCTCACCATGTGGG +HWI-EAS305:1:1:1:1699#0/1 JSUSSUWTWWQUUUVVUUUTTTUTUWBBBBBBBB @HWI-EAS305:1:1:1:1616#0/1 ATTGGCGACGATATTCAGGTCCATGTTTCTTGCG +HWI-EAS305:1:1:1:1616#0/1 NYYYVRVVVSVYYYYWUNNPWVYBBBBBBBBBBB @HWI-EAS305:1:1:1:1755#0/1 CAGTCACGAATCGGTGCGTCTTTCACCTGACACA +HWI-EAS305:1:1:1:1755#0/1 PWVWWVTKTWWWPMTRWQWUWWUUUSSODORSUU @HWI-EAS305:1:1:1:1046#0/1 CCCTGACAGGTAACAGGAGGTGGCTGGGGCTGAG +HWI-EAS305:1:1:1:1046#0/1 BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
[browns02@sanger]$ more s_1_sequence.txt [browns02@sanger]$ wc s_1_sequence.txt 32806276 32806276 1062955212 s_1_sequence.txt [browns02@sanger]$ grep "@" s_1_sequence.txt | wc 8201569 8201569 244422691 [browns02@sanger]$ grep A 1 "@ s_1_sequence.txt > s1.seq