Computational Biology
Computational Biology
Computational Biology
Computational Biology -
Unix/Linux, Data Processing and Programming
Springer-Verlag Berlin Heidelberg GmbH
Röbbe Wünschiers
Computational Biology –
Unix/Linux,
Data Processing and Programming
123
Dr. Röbbe Wünschiers
University of Cologne
Institute for Genetics
Weyertal 121
50931 Köln
Germany
This work is subject to copyright. All rights reserved, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broad-
casting, reproduction on microfilm or in any other way, and storage in data banks. Duplication of this
publication or parts thereof is permitted only under the provisions of the German Copyright Law of
September 9, 1965, in its current version, and permission for use must always be obtained from
Springer-Verlag. Violations are liable for prosecution under the German Copyright Law.
springeronline.com
The use of general descriptive names, registered names, trademarks, etc. in this publication does not
imply, even in the absence of a specific statement, that such names are exempt from the relevant pro-
tective laws and regulations and therefore free for general use.
A shift in culture
Only a decade ago, the first thing a molecular biologist would have had to
learn when he or she started the lab work was how to handle pipettes, extract
DNA, use enzymes and clone a gene. Now, the first thing that he or she should
learn is how to handle databases and to extract all the information that is
already known about the gene that he or she wants to study. In all likelihood,
he or she will find that the gene has already been sequenced from several
organisms, that it was recovered in a variety of EST projects, that expression
data are available from microarray and SAGE studies, that it was included
in linkage studies, that proteomics data are rapidly accumulating, that lists
of interacting proteins are being compiled, that domain structure data are
available and that it is part of a network of genetic interactions which is
intensively modelled. He or she will discover that all this information resides
in many different databases with different data formats and with different
levels of analyses and linking. Starting to work on this gene will make sense
only, if all this information is put together in a project-specific manner and
set into the context of what is known about related genes and processes. At
this point he or she may decide to walk up to the bioinformatics group in
house and ask for help with arranging the data in a useful manner. This will
then turn into the first major frustration in his or her career, since the last
thing a scientific bioinformatics group wants to do is to provide a service for
data retrieval and management.
Molecular biology is currently going through a dramatic cultural shift. The
daily business of pipetting and gel running has increasingly to be comple-
mented with data compiling and processing. Large lists of data are produced
by sequencers, microarray experiments or real-time PCR machines every day.
Working with data lists has become as important as extracting DNA. Every
bench scientist needs proficiency in computing; discoveries are made both at
the bench and on the screen. It would be completely wrong to think that
computers in molecular biology are the business of bioinformaticians only.
VIII Foreword
Bioinformatics has become a scientific discipline of its own and should not
be considered to be a service provider. The day-to-day computing will always
have to be done by the experimentalist himself or herself.
Of course, there are now also a lot of helpful and fancy program packages
for the bench scientist; but these will only perform routine tasks and all too
often they are only poorly compatible. A scientist needs the freedom to develop
his or her own ideas and to link things that have previously not been linked.
Being able to go back to the basics of computing and programming is therefore
a vital skill for the experimentalist, as important as making buffers and setting
up enzyme reactions. It allows him or her to handle and analyze the data in
exactly the way it is required for the project and to pursue new avenues of
research, rather than trotting old paths.
Unix is the key to basic computing. If one is used to Windows or Mac
operating systems, this might at first sound like going back into the stone
age; but the dramatic recent shift of at least the Mac operating system to
a Unix base should teach us otherwise. Unix is here to stay and it allows
the largest flexibility for bioinformatics applications. Those who have learned
Unix will soon discover the myriad of little “progies” that are available from
colleagues all over the world and that can make life much easier in the lab.
This book gives exactly the sort of introduction into Unix, Unix-based
operating systems and programming languages that will be a key competence
for experimentally working molecular biologists and that will make all the
difference for the successful projects of the future. It has been written by a
bench scientist, specifically with the needs of molecular biologists in mind. It
can be used either for self-teaching or in practical courses. Every group leader
should hand over this book to new students in the lab, together with their
first set of pipettes.
Welcome on board!
With this book I would like to invite you, the scientist, to a journey through
terminals and program codes. You are welcome to put aside your pipette,
culture flask or rubber boots for a while, make yourself comfortable in front of
a computer (do not forget your favourite hot alcohol-free drink) and learn some
unixing and programming. Why? Because we are living in the information age
and there is a huge amount of biological knowledge and databases out there.
They contain information on almost everything: genes and genomes, rRNAs,
enzymes, protein structures, DNA-microarray experiments, single organisms,
ecological data, the tree of life and endless more. Furthermore, nowadays many
research apparatuses are connected to computers. Thus, you have electronic
access to your data. However, in order to cope with all this information you
need some tools. This book will provide you with the skills to use these tools
and to develop your own tools, i.e. it will introduce Unix and its derivatives
(Linux, Mac OS X, CygWin, etc.) and programming (shell programming, awk,
perl). These tools will make you independent of the way in which other people
make you process your data – in the form of application software. What you
want is open functionality. You want to decide how to process (e.g. analyze,
format, save, correlate) data and you want it now – not waiting for the lab
programmer to treat your request; and you know it best – you understand
your data and your demands. This is what open functionality stands for, and
both Linux and programming languages can provide it to you.
I started programming on a Casio PB-100 hand-held built in 1983. It can
store 10 small Basic programs. The accompanying book was entitled “Learn as
you go” and, indeed, in my opinion this is the best way to learn programming.
My first contact to Unix was triggered by the need to copy data files from a
Unix-driven Bruker EPR-Spectrometer onto a floppy disk. The real challenge
started when I tried to import the files to a data-plotting program on the PC.
While the first problem could be solved by finding the right page in a Unix
manual, the latter required programming skills – Q-Basic at that time. This
X Preface
problem was minor compared to the trouble one encounters today. A common
problem is to feed one program with the output of another program: you might
have to change lines to columns, commas to dots, tabulators to semicolons,
uppercase to lowercase, DNA to RNA, FASTA to GenBank format and so
forth. Then there is that huge amount of information out there in the web,
which you might need to bring into shape for your own analysis.
You and This Book – This book is written for the total beginner. You need not
even to know what a computer is, though you should have access to one and
find the power switch. The book is the result of a) the way how I learned to
work with Unix, its derivatives and its numerous tools and b) a lecture which
I started at the Institute for Genetics at the University of Cologne/Germany.
Most programming examples are taken from biology; however, you need not
be a biologist. Except for two or three examples, no biological knowledge is
necessary. I have tried to illustrate almost everything practically with so-called
terminals and examples. You should run these examples. Each chapter closes
with some exercises. Brief solutions can be found at the end of the book.
Why Linux? – This book is not limited to Linux! All examples are valid for
Unix or any Unix derivative like Mac OS X, Knoppix or the free Windows-
based CygWin package, too. I chose Linux because it is open source software:
you need not invest money except for the book itself. Furthermore, Linux
provides all the great tools Unix provides. With Linux (as with all other
Unix derivatives) you are close to your data. Via the command line you have
immediate access to your files and can use either publicly available or your
own designed tools to process these. With the aid of pipes you can construct
your own data-processing pipeline. It is great.
Why awk and perl? – awk is a great language for both learning programming
and treating large text-based data files (contrary to binary files). To 99% you
will work with text-based files, be it data tables, genomes or species lists.
Apart from being simple to learn and having a clear syntax, awk provides you
with the possibility to construct your own commands. Thus, the language can
grow with you as you grow with the language. I know bioinformatic profes-
sionals entirely focusing on awk. perl is much more powerful but also more
unclear in its syntax (or flexible, to put it positively), but, since awk was one
basis for developing perl, it is only a small step to go once you have learned
awk – but a giant leap for your possibilities. You should take this step. By the
way, both awk and perl run on all common operating systems.
Acknowledgements – Special thanks to Kristina Auerswald, Till Bayer, Bene-
dikt Bosbach and Chris Voolstra for proofreading, and all the other students
for encouraging me to bring these lines together.
Hürth/Germany,
January 2004 Röbbe Wünschiers
Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1 Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 What Is Linux? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 What Is Shell Programming? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 What Is sed? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 What Is awk? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.6 What Is perl? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.7 Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.8 Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Unix/Linux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1 What Is a Computer? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Some History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 Versions of Unix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.2 The Rise of Linux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.3 Why a Penguin? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.4 Linux Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 X-Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.1 How Does It Work? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4 Unix/Linux Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.5 What Is the Difference Between Unix and Linux? . . . . . . . . . . . 17
2.6 What Is the Difference Between Unix/Linux and Windows? . . 17
2.7 What Is the Difference Between Unix/Linux and Mac OS X? . 18
2.8 One Computer, Two Operating Systems . . . . . . . . . . . . . . . . . . . 18
2.8.1 VMware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.8.2 CygWin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
XII Contents
2.8.3 Wine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.8.4 Others . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.9 Knoppix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.10 Software Running Under Linux . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.10.1 Biosciences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.10.2 Office and Co. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
8 Shell Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
8.1 Script Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
8.2 Modifying the Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
8.3 Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
8.4 Input and Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
XIV Contents
10 Sed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
10.1 When to Use sed? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
10.2 Getting Started . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
10.3 How sed Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
10.3.1 Pattern Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
10.3.2 Hold Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
10.4 sed Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
10.4.1 Addresses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
10.4.2 sed and Regular Expressions . . . . . . . . . . . . . . . . . . . . . . 147
10.5 Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
10.5.1 Substitutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
10.5.2 Transliterations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
10.5.3 Deletions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
10.5.4 Insertions and Changes . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
10.5.5 sed Script Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
10.5.6 Printing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
10.5.7 Reading and Writing Files . . . . . . . . . . . . . . . . . . . . . . . . 156
10.5.8 Advanced sed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
10.6 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
10.6.1 Gene Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
10.6.2 File Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
10.6.3 Reversing Line Order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
Part IV Programming
11 Awk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
11.1 Getting Started . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
11.2 awk’s Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
11.3 Example File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
11.4 Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
11.4.1 Regular Expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
11.4.2 Pattern-Matching Expressions . . . . . . . . . . . . . . . . . . . . . 168
11.4.3 Relational Character Expressions . . . . . . . . . . . . . . . . . . 169
11.4.4 Relational Number Expressions . . . . . . . . . . . . . . . . . . . . 171
11.4.5 Mixing and Conversion of Numbers and Characters . . 171
11.4.6 Ranges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
11.4.7 BEGIN and END . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
11.5 Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
11.5.1 Assignment Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
11.5.2 Increment and Decrement . . . . . . . . . . . . . . . . . . . . . . . . . 176
11.5.3 Predefined Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
11.5.4 Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
11.5.5 Shell Versus awk Variables . . . . . . . . . . . . . . . . . . . . . . . . 185
XVI Contents
12 Perl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
12.1 Intention of this Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
12.2 Running perl Scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
12.3 Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
12.3.1 Scalars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
12.3.2 Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
12.3.3 Hashes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
12.3.4 Built-in Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
12.4 Decisions – Flow Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
12.4.1 if...elseif...else . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
12.4.2 unless, die and warn . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
12.4.3 while... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
12.4.4 do...while... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
12.4.5 until... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
12.4.6 do...until... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
12.4.7 for... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
12.4.8 foreach... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
12.4.9 Controlling Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
12.5 Data Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
12.5.1 Command Line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
12.5.2 Internal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
Contents XVII
A Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
A.1 Keyboard Shortcuts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
A.2 Boot Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
A.3 Optimizing the Bash Shell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
A.3.1 Startup Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
A.3.2 A Helpful Shell Prompt . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
A.4 Text File Conversion (Unix ↔ DOS) . . . . . . . . . . . . . . . . . . . . . . 267
A.5 Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
A.6 Mounting Filesystems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
A.6.1 Floppy Disk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
A.6.2 CD-ROM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
A.6.3 PCMCIA and Compact Flash . . . . . . . . . . . . . . . . . . . . . 270
A.7 Nucleotide and Protein Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
A.8 Special Characters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
Part I
1.1 Information
This book aims at the total beginner. However, if you know something about
computers but not about programming, the book will still be useful for you.
After introducing the basics of how to work in the Unix/Linux environment,
some great tools will be presented. Among these are the stream line editor sed
and the script-oriented programming languages awk and perl. These utilities
are extremely helpful when it comes to formatting and analyzing data files.
After you have worked through all the chapters, you can use this book as a
reference. The learning approach is absolutely practically oriented. Thus, you
are invited to run all examples, printed in so-called Terminals, on your own!
If you face any problems: contact me! Of course, I cannot help you if your
non-unix-like-operating-system driven computer crashes continuously. How-
ever, if things connected to this book confuse you – or you even find errors –
please let me know:
Further information about this book, including lists with internet links and
known errors, can be found at my homepage.
Homepage: www.uni-koeln.de/∼aei53
You are very much welcome to supply me with good ideas for examples!
Figure 1.1 shows a sketch for an example of what sed can do. Here, the stop
codon of a DNA sequence is replaced by the text “!STOP!”. sed is well suited
to perform small formatting tasks like converting RNA to DNA, commas to
points, tabs to semicolons and the like.
Peter J. Weinberger and Brian W. Kernighan in 1977. This is where the name
comes from. awk is really a great tool when it comes to analyzing the content
of data files. With awk you can perform calculations, draw decisions, read and
write multiple files. What is best is that awk can be extended with your own
designed functions. A typical task would be to fuse the content files having
one common field. Another typical task would be to extract data matching
certain criteria as shown above. awk forms the kernel of this book. After you
have finished the chapter on awk you should be able to a) program basically
anything you need and b) learn any other programming language.
The example shown in Fig. 1.2 shows one basic function of awk. All enzyme
names in the file enzymes.file are printed, if the corresponding Km value (do
you remember enzyme kinetics and Michaelis-Menten?) is smaller than 0.4.
1.7 Prerequisites
In order to perform the exercises you need to have access to a computer run-
ning either Linux, Unix or Mac OS X (the newest Apple operating system)
6 1 Introduction
or the free Windows Unix emulator CygWin (see below). As you will learn
soon, these systems are very similar. Thus, all the things we are going to learn
will work on all Linux, Unix and Mac OS X computers. On a normal installa-
tion all required programs should be installed. Otherwise, contact the system
administrator. My personal recommendation is to start with Knoppix Linux
(see Sect. 2.9 on page 20). Knoppix runs from a CD-ROM and it requires
no installation on the hard disk drive. Neither your operating system nor the
data on your computer will be touched.
Alternatively, you can install the free Cygwin Unix emulator on a computer
running the Microsoft Windows operating system (from Win95 upwards, ex-
cluding WinCE) (see Sect. 2.8.2 on page 19) or the commercial VMware pack-
age in order to run two operating systems on one computer (see Sect. 2.8.1 on
page 19).
1.8 Conventions
What you see on your computer screen is written in typewriter style and
boxed. I will refer to this as the Terminal. The data you have to enter are
given behind the $ character. Key labels are written in a box. For example,
the key labelled “Enter” would be written as Enter
. Commands that appear
in the text are written in typewriter, too. When necessary, space characters
are symbolized by a “”. Thus, “” means that you have to type three
consecutive spaces. In the following example you would type date as input
and get the current date as output.
1: Date
1 $ date
2 Thu Feb 13 18:53:26 CET 2003
3 $
In most cases you will find some text behind the terminal which describes
the terminal content: in Terminal 1, line 1, we check for the current date and
time.
Boxes labelled “Program” contain script files or programs. These have to be
saved in a file as indicated in the first or second program line: # save as
hello.sh. You will find the program under the same name on the accompa-
nying website. As terminals, programs are numbered.
Program 1: Test
1 #!/bin/sh
2 # save as hello.sh
3 # This is a comment
4 echo "Hello World"
At the end of most chapters you will find exercises. These are numbered,
too. The solution can be found in Chapter A.8 on page 273.
Part II
This chapter’s intention is to give you an idea about what operating systems
in general, and Unix/Linux in particular, are. It is not absolutely necessary
to know all this. However, since you are going to work with these operating
systems you are taking part in their history! I feel that you should have heard
of (and learned to appreciate) the outline of this history. Furthermore, it gives
you some background on the system we are going to restrain; but if you are
hungry for practice you might prefer to jump immediately to Chapter 3 on
page 27.
Fig. 2.1. A simple computer based on Zilog’s Z80 processor (1 ). The system is
equipped with 4 Kilobyte RAM (random access memory) (2 ). Data input and output
are realized with electric switches and lamps, respectively (3,4 ). All components
are connected via the system bus (5 ). The speed of the processor can be adjusted
between single step and 1 MHz. The whole system was built by the author in the
late 1980s
In this environment, Ken Thompson and Dennis Ritchie, who worked for
the Bell Laboratories of AT&T, developed the first version of Unix in 1969
(Fig. 2.2 on the next page). At this time, Unix was based on the research oper-
ating system Multics (multiplex information and computing system). Multics
was an interactive operating system that was designed to run on the GE645
computer built by General Electric. With time, Multics developed into Unix,
which ran on the PDP-7 computer built by Digital Equipment corporation
(today Compaq). The new operating system had multitasking abilities and
could be accessed by two persons. Some people called it Unics (uniplexed in-
formation and computer system). However, the operating system was limited
to the PDP-7 hardware and required the PDP-7 assembler (this is the part
which translates the operating system commands to processor code). It was
Dennis Ritchie from Bell Laboratories who rewrote Unix in the programming
language C. The advantage of C is that it runs on many different hardware
systems – thus, Unix was now portable to hardware other than PDP-7 ma-
chines. Unix took off when Dennis Ritchie and Ken Thompson published a
paper about Unix in July 1974 [10]. In the introduction to their paper they
wrote: “Perhaps the most important achievement of UNIX is to demonstrate
that a powerful operating system for interactive use need not be expensive
either in equipment or in human effort: UNIX can run on hardware costing as
little as $40,000, and less than two manyears were spent on the main system
software.”
With time, Unix became popular among AT&T companies, including Bell
Laboratories and academic institutions like the University of California in
Berkeley. From 1975 on, the source code of Unix was distributed for a small
fee. From that point on, Unix started to take off (Fig. 2.2 on the following
page).
Were there only nearly a dozen world-wide Unix installations in spring
1974, three dozen were registered in spring 1975, around 60 in autumn 1975
and 138 in autumn 1976. AT&T itself was not very interested in selling Unix.
Thus, from 1975 on, Unix’s development took largely place outside Bell Lab-
oratories, especially at the University of California in Berkeley. In contrast
to proprietary operating systems like Microsoft’s DOS, Unix was now de-
veloped and sold by more than 100 companies including Sun Microsystems
(Solaris), Hewlett Packard (HP/UX), IBM (AIX) and the Santa Cruz Op-
eration (SCO-Unix). Of course, this led to the availability of different and
partially incompatible Unix versions, which is a major problem for software
developers. However, two Unix variants floated on top of all distribution:
AT&T System V – Developed in 1976 by the Bell Laboratories, the AT&T
System V was distributed to many international universities. In 1989 AT&T
founded the Unix-System Laboratories (USL) for further development of the
12 2 Unix/Linux
Fig. 2.2. Evolution of operating systems. Note the recent hybridization of Unix and
Macintosh with Mac OS X in 2001. Unix-based operating systems are underlaid in
grey
source code of System V. In 1991, a cooperation between USL and the com-
pany Novell started, which led to UnixWare running on Intel platforms. In
1996, Novell sold its Unix department to the Santa Cruz Operation (SCO),
which is still developing Unix.
BSD v4 (Berkeley Software Distribution) – In parallel to AT&T, the Uni-
versity of California in Berkeley developed Unix up to version BSD v4. This
version was running on Vax computers. In 1991, the spin-off company Berke-
ley Software Design Inc. was founded and, from that year on, has been selling
BSD-Unix commercially. However, in parallel, there are still freeware versions
available, namely FreeBSD and NetBSD. Apple’s new operating system Mac
OS X is actually based on BSD (see Sect. 2.7 on page 18).
Thus, the presently available Unix versions are based on either AT&T System
V or BSD v4 (Fig. 2.2). In order to secure the compatibility between different
Unix version The Open Group was founded in 1996.
Linux is a very young operating system. Its first version was distributed by
Linus Torvalds in 1991. The following news group posting was the first official
announcement of the new operating system.
2.2 Some History 13
At that time the Finn Linus Torvalds was a student at the University
of Helsinki (Finland). Since he offered the software freely and with source
code via the internet, many programmers around the world had access to
it and added more components to it, like improved file organization, drivers
for different hardware components and tools like a DOS emulator. All these
enhancements were again made available for free, including the source code.
Thus, if there was any error with a software component it could be fixed by
experienced programmers. A very important part of that concept was estab-
lished already in 1984 by Richard Stallman with the GNU (GNU is not Unix)
and FSF (Free Software Foundation) projects. GNU programs are freely avail-
able and distributable under the General Public Licence (GLP). In short, this
means that GNU software can be used, developed and redistributed for free
or commercially. However, the source code must always be a free part of the
distribution. Many Unix users used GNU programs as substitute for expen-
sive original versions. Popular examples are the text editor emacs, the GNU
C-compiler (gcc) and diverse utilities like grep, find and gawk. Almost all
open-source activities are now under the roof of the Open Source Initiative
(OSI) directed by Eric S. Raymond.
As a matter of fact, Linux was not developed out of the blue but, from the be-
ginning on, used GNU elements (like the operating system Mimix). Once the
GNU C-compiler was running under Linux, all other GNU utilities could be
compiled to run under Linux. Only the combination of the Linux kernel with
GNU components, the network software from BSD-Unix, the free X Windows
14 2 Unix/Linux
System (see Sect. 2.3) from the MIT (Massachusetts Institute of Technol-
ogy) and its XFree86 port for Intel-powered PCs and many other programs,
converted Linux to a powerful operating system which could compete with
Unix.
You might already have come across the nice penguin with his large orange
feet and beak. The penguin is the mascot and symbol for Linux. Why is this?
Well, although once bitten by a penguin in an Australian zoo, Linus Torvalds
(the inventor of Linux) loves penguins [16]. Once the idea to use a penguin
as logo for Linux was born, Linus Torvalds screened a variety of suggestions
and finally chose the version from Larry Ewing, a graphics programmer and
assistant system administrator at the Institute for Scientific Computing at the
A&M University in Texas, USA. The penguins name is Tux, which is derived
from the dinner jacket called tuxedo (short, tux).
There is one major problem when many people around the world develop and
update an operating system: it is an awful task to collect all necessary and
up-to-date components from the internet. Thus, different companies appeared,
which took over the job and distributed complete sets of the Linux kernel with
software packages, drivers, documentation and more or less comfortable instal-
lation programs and software package managers. Currently, there are many
different Linux distributions on the market, the most famous among them
are: Red Hat, SuSE, Debian, Mandrake and Knoppix. The good thing with
Knoppix Linux is that you can run it from the CD-ROM. It does not require
any installation and does not touch your hard disk drive at all, although you
have access to it if you want. Therefore, it is well suited for playing around and
I highly recommend its use for all the exercises in this book (see Sect. 2.9 on
page 20).
Finally, there are some minimal distributions available on the net. This means
you can have Linux on two floppy disks or so and install a minimum system
on an old computer (i396 or i486).
2.3 X-Windows
We will work only with the command line. That looks a bit archaic, like in the
good old DOS times. Anyway, Linux also provides a graphical user interface
(GUI). In fact, it provides many GUIs and thus is much more powerful than
Microsoft Windows. Graphical user interfaces are systems that make com-
puting more ergonomic. Basic elements are windows which are placed on the
2.3 X-Windows 15
The graphical desktop can be split into several parts: the X-server, the X-
clients, the window manager and the desktop; but the Unix/Linux desktop is
something different from a Windows desktop.
The X-server is hosted by the computer (or graphical terminal station) you
are sitting in front of. An X-server is nothing more than the black and white
chess pattern you see very shortly before the graphical background of your
window manager appears. The X-server handles the graphical presentation
and is responsible for the communication between the hardware (in partic-
ular the graphic card, which handles the screen) and the software (the X-
programs). If you would run only an X-server, no graphics would be possi-
ble. It offers neither a menu, nor windows or any other features you need.
Here the X-client comes into play. The work of the X-server is done with X-
clients. These X-clients use libraries that are integrated in the X-server and
contain the information how to display graphics. If you see an X-terminal
(console) on your monitor, it is an X-client. The communication between X-
server and X-client works through the network. That is the reason why you
can start an X-client on any computer in the network and see it somewhere
else (this is the way the X-terminals work; you start the programs on a pow-
erful server and sit in front of a simple terminal). Even if you work on an
16 2 Unix/Linux
isolated computer and have no network card installed, X-client and X-server
communicate via the network. In that case, Unix/Linux simulates a network
(loopback). X-server and X-clients alone are not really comfortable to work
with. Useful functions like “Maximizing”, “Minimizing” and “Close Window”
are not included in the functionality of the X-server and X-clients but made
available by a window manager. There are several window managers avail-
able, like FVWM (www.fvwm.org), IceWM (www.icewm.org), Window Maker
(www.windowmaker.org), Sawfish, formerly Sawmill (sawmill.sourceforge.net )
and Metacity (www.gnome.org/ softwaremap/projects/Metacity).
A long time ago, only these three components (X-server, X-client and win-
dow manager) existed. However, in the past years, an additional “thing” has
come into existence: the desktop. The desktop offers functionalities similar to
those we are used to from Windows, like putting program and file icons onto
the desktop, which can be started or opened by double-clicking, respectively.
However, with Unix/Linux, usually a single click is sufficient. Of course, icons
existed long before (e.g. after minimizing a window with FVWM), but the
functionality was not as great as with a desktop. KDE and Gnome are two
popular desktops running on Unix/Linux. They come with their own win-
dow managers, but both programs could be used with alternative window
managers as well.
Fig. 2.3. The kernel is the heart of Unix/Linux. It is wrapped around the hardware
and accessed via the shell
The hardware, i.e. the memory, disk drives, the screen and keyboard and
so on, is controlled by the kernel. The kernel is, as the name implies, the
2.6 What Is the Difference Between Unix/Linux and Windows? 17
heart of Unix/Linux. Only the kernel has direct control over the hardware.
Unix/Linux’s kernel is programmed in C. It is continuously improved; this
means errors are corrected and drivers for new hardware components that
are available are included. Thus, when you buy a Unix/Linux version, the
kernel version is usually indicated. The newest kernel is not necessarily the
best: it might have new errors. For us, the users, it is very uncomfortable to
communicate with the kernel. Therefore, the shell was developed. The shell
is an envelope around the kernel. It provides commands to work with files or
access floppy disk drives and such things. You are going to learn more about
the shell in Chapter 7 on page 81. Finally, there are the applications like
OpenOffice or ClustalX or awk or perl. These communicate with the kernel,
usually via the shell. You see, the Unix/Linux operating system is clearly
structured. This is part of Unix’s philosophy: “Keep it simple, general and
extensible”.
and server-oriented Windows operating systems run only on Intel and AMD
processors. Furthermore, Unix-based operating systems are much closer to the
data. Thus it is much easier to format and analyze files.
your Windows session and reboot into Linux. The problem is that you can-
not do both at the same time. Each time you switch back and forth between
Windows and Linux, you have to reboot again. This can quickly get tiresome.
Therefore, you might want to use two operating systems on one computer –
in parallel! There are several possibilities.
2.8.1 VMware
2.8.2 CygWin
2.8.3 Wine
2.8.4 Others
Of course, there are many other possibilities and you should check out yourself
which solution is suitable for your setup. With Win4Lin you can run Win-
dows programs under Linux (www.netraverse.com). Another option might be
Plex86 (plex86.sourceforge.net). Plex86 is an extensible free PC virtualization
software program which will allow PC and workstation users to run multi-
ple operating systems concurrently on the same machine. Plex86 is able to
run several operating systems, including MSDOS, FreeDOS, Windows9x/NT,
Linux, FreeBSD and NetBSD. It will run as much of the operating system
and application software natively as possible, the rest being emulated by a
PC virtualization monitor. I am sure there is more software available and
even more to come. Check out yourself.
2.9 Knoppix
Knoppix is a free Linux distribution developed by Klaus Knopper, which
can be run from a CD-ROM. This means you put the CD-ROM into your
computer’s CD-ROM drive, switch on the computer and start working with
Linux. Knoppix has a collection of GNU/Linux software, automatic hard-
ware detection and support for many graphics cards, sound cards, SCSI and
USB devices and other peripherals. It is not necessary to install anything
on the hard disk drive. Due to on-the-fly decompression, the CD carries al-
most 2 GB of executable software. You can download or order Knoppix from
(www.knopper.net/knoppix ).
The minimum system requirements are an Intel-compatible CPU (486, Pen-
tium or later), 20 MB of RAM for the text-only mode or at least 96 MB for
the graphical mode with KDE. At least 128 MB of RAM is recommended if
you want to use OpenOffice. Knoppix can also use memory from the hard disk
drive to substitute missing RAM. However, I cannot recommend using this op-
tion, since it decreases the performance. Take a look at the Knoppix manual in
order to learn how to do this. Of course, you need to have a bootable CD-ROM
drive or a boot floppy if you have a standard CD-ROM drive (IDE/ATAPI or
SCSI). For the monitor, any standard SVGA-compatible graphic card will do.
If you are one of those freaks playing the newest computer games and always
buying the newest graphic card you might run into trouble; but I am sure that
you then own at least one out-dated computer you can use for Linux. As a
mouse you can either use a standard serial or PS/2 or an IMPS/2-compatible
USB-mouse.
Before you can start Knoppix, you need to change your computers BIOS (ba-
sic input/output system) settings to boot from the CD. When you start up
your computer you are normally asked whether you want to enter the system
setup (BIOS). Usually one of the following key (combinations) is required:
Del or Esc or F2 or Ctrl
+Alt
+Esc or Ctrl
+Alt
+S
. When you succeed
2.10 Software Running Under Linux 21
in changing the settings you are done; but be careful not to change anything
else! That might destroy your computer. However, if your computer does not
support the option to boot from the CD-ROM, or you are afraid of doing
something wrong, you have to use a boot disk. You can create this disk from
the image in the file /KNOPPIX/boot.img on the CD-ROM. Read the manual
on the Knoppix CD-ROM for more details on this issue. Once prepared, you
put the CD in the drive and power up the computer. After some messages
the system halts and you see the input prompt “boot:”. Now you can hit F2
and optimize Knoppix to your needs. For example, knoppix lang=de enables
the German keyboard layout.
2.10.1 Biosciences
There is a lot of software available for all fields of academics. Being a biol-
ogist myself, I will restrict myself to list some important biological software
packages available for Linux. An updated list can be obtained from bioinfor-
matics.org/software.
Emboss – Emboss is the European Molecular Biology Open Software Suite.
Emboss is freely available and specially developed for the needs of molecular
biologists. Currently, Emboss provides a comprehensive set of over 100 se-
quence analysis programs. The whole range from DNA and protein sequence
editing, analysis and visualization is covered. Restriction analysis, primer de-
sign and phylogenetic analysis can all be performed with this software package.
Take a look at www.hgmp.mrc.ac.uk/Software/EMBOSS.
Staden Package – The Staden Package is a software package free for aca-
demics (charge for commercial users) including sequence assembly, trace view-
ing/editing and sequence analysis tools. It also includes a graphical user in-
terface to the Emboss suite. More information can be obtained at
www.mrc-lmb.cam.ac.uk/pubseq/staden home.html
Blast – Blast (Basic Local Alignment Search Tool) is a set of similarity
search programs designed to explore online sequence databases. Alternatively,
you can set up your own local databases and query those. In Chapter 5 on
page 53 we will download, install and run this program.
ClustalW – ClustalW is probably the most famous multiple-sequence align-
ment program available. It is really powerful and can be fine-tuned by a num-
ber of program options. In Chapter 5 on page 53 we will download, install
and run this program.
22 2 Unix/Linux
In this chapter you will learn the basics in order to work on a Unix-based
computer: login, execute commands, logout. Everything you are going to do is
happening at the command line level. This means, the look and feel will be
like in good old DOS times. It will look like stone-age computing. However,
you should remember that although a graphical interface is often nice and
comfortable, it consumes a lot of power and only hinders us from learning what
is really important. Furthermore, the Unix/Linux command line is extremely
powerful. You will soon get accustomed to it and never want to miss it again.
Let us face it...
3.1 Login
The process of making yourself known to the computer system and getting to
your Unix/Linux account is called logging in. There are two prerequisites: first
you need to connect to the computer, and second you must have an account
on that computer. That is like withdrawing money from a bank account. You
must identify yourself to the cash machine before you get any money and
you must, of course, have a bank account (with some money in it). There are
several ways to connect to a Unix-based computer.
In case you want to work with Apple’s Mac OS X operating system (see
Sect. 2.7 on page 18) or the Unix emulator CygWin, which runs in the Win-
dows environment (see Sect. 2.8.2 on page 19), then you need not care about
any password. When you start CygWin, you will directly end up at the com-
mand line of the bash shell. In Mac OS X you find the Terminal application
under “Applications → Utilities”. Again, starting the application will imme-
diately bring you to the command line. In any case you should look up your
username with the command whoami (just type whoami and then press Enter ).
28 3 The First Touch
You might have installed Linux on your computer or run it from a CD-ROM
on your computer (like the Knoppix distribution of Linux, see Sect. 2.9 on
page 20). In that case, you boot the computer and get directly to the login
screen. Depending on the system settings, this might be either a graphical
login or a command line login. In both cases you enter your username and
password (with Knoppix you do not need a username and password; you are
directly locked in as user knoppix ). With the graphical login you immediately
end up on a nice-looking graphical desktop – most probably KDE or Gnome.
In that case, you have to open a terminal window in order to follow the
examples given in this script. If the command line login appears after booting
the computer you have to follow the instruction given in Terminal 2.
Fig. 3.1. The PuTTY configuration window. The most important information
needed is the IP address or host name of the computer you want to work on re-
motely
as superuser or root. The root account belongs to the administrator who has
access to all system components and owns all rights. In Terminal 2 on the
facing page the shell prompt tells you that you are connected as user Freddy
to the computer nukleus and you are currently in the directory Freddy (that
is Freddy’s home directory). The messages you see at login differ from system
to system; but this general outline should work out in most cases.
I guess it is clear that you have to press the Enter or Return key after you
have entered your name and password, respectively. Furthermore, make sure to
type your username at the login prompt and your password at the password
prompt without errors. Backspacing to delete previous characters may not
work (though Linux is more forgiving than Unix). If you make a mistake,
repeatedly use the Enter key to get a new login prompt and try again. Also
make sure to use the exact combination of upper- and lowercase letters. Linux
is case-sensitive!
When you login to your account it might look different. However, as a general
rule, you get three messages after a command line login: a) motd – This is the
message of the day. Here the system administrator might inform you about
system-down times or just present a message to cheer you up. b) last login –
As mentioned above, this is the date of your last login. c) You have new mail
– This message does exactly what it says. It tells you that at least one new
email is waiting for you in your email program.
30 3 The First Touch
Once you have logged in to the computer you can start working. Let us start
with some simple commands in order to get accustomed to the Unix/Linux
environment. Commands are entered in the same way as you just entered your
username: enter the command date and press the Enter key.
Terminal 3: Date
1 $ date
2 Tue Jan 23 17:23:43 CET 2003
3 $
The character “>” redirects the output of the command date into a file
which is named file-with-date. If it does not exist, this file is created auto-
matically. Otherwise it will be overwritten. You could use “>>” to append the
output to an existing file. Of course, since the output is redirected, you do not
see the result. However, as shown in Terminal 5, you can use the command
cat to display the content of a file.
Terminal 5: Show Date-File
1 $ cat file-with-date
2 Tue Jan 23 17:25:44 CET 2003
3 $
In general, commands can be simple one-word entries such as the date com-
mand. They can also be more complex. Most commands accept arguments.
An argument can be either an option or a filename. The general format for
commands is: command option(s) filename(s). Commands are entered in
lowercase. Options often are single characters prefixed with a dash (see Termi-
nal 6 on the facing page). Multiple options can be written individually (-a -b)
or, sometimes, combined (-ab) – you have to try it out. Some commands have
options made from complete words or phrases. They start with two dashes,
like --confirm-delete. Options are given before filenames. Command, op-
tion and filename must be separated by spaces. In a few cases, an option has
another argument associated with it. The sort command is one example (see
Sect. 6.1.1 on page 65). This command does sort the lines of one or more text
files according to the alphabet. You can tell sort to write the sorted text to
a file, which name is given after the option -o (output).
Terminal 7: Options
1 $ date > file-with-date
2 $ cat file-with-date
3 Tue Jan 23 17:25:44 CET 2003
4 $ date >> file-with-date
5 $ cat file-with-date
6 Tue Jan 23 17:25:44 CET 2003
7 Tue Jan 23 17:25:59 CET 2003
8 $ sort file-with-date
9 Tue Jan 23 17:25:44 CET 2003
10 Tue Jan 23 17:25:59 CET 2003
11 $ sort -o sorted-file file-with-date
12 $ cat sorted-file
13 Tue Jan 23 17:25:44 CET 2003
14 Tue Jan 23 17:25:59 CET 2003
15 $
Together with your username your password clearly identifies you. If anyone
knows both your username and password, he or she can do everything you
can do with your account. You do not want that! Therefore you must keep
your password secret. When you logged in for the first time to Unix/Linux
you probably got a password from the system administrator; but probably
you want to change it to a password you can remember. To do so you use the
command passwd as shown in Terminal 8.
Terminal 8: passwd
1 $ passwd
2 Changing password for user Freddy.
3 Changing password for Freddy
4 (current) UNIX password:
5 New password:
6 Retype new password:
7 passwd: all authentication tokens updated successfully.
8 $
First you have to enter your current password. Then, in line 5 in Terminal 8
you are requested to enter your new password. If you choose a very simple
password, the system might reject it and ask for a more complicated one. Be
aware that passwords are case-sensitive and should not contain any blanks
(spaces)! You should also avoid using special characters. If you log in from
a remote terminal the keyboard might be different. To avoid searching for
the right characters, just omit them right from the beginning. After you have
retyped the password without mistake, the system tells you that your new
password is active. Do not forget it!
3.3 Logout 33
Imagine you remember a command name but you cannot recall its function or
syntax. What can you do? Well, of course, you always have the option of trial
and error. However, that might be dangerous and could corrupt your data.
There are much more convenient ways.
First, many commands provide the option -h or --help. Guess what: h stands
for help. Actually, --help works more often than -h and might be easier to
remember, too. The command ls -h, for example, lists file sizes in a human
readable way – you must use ls --help instead in order to obtain help. By
applying the help option, many commands give a short description of how to
use them.
There is, however, a much more informative way than using the help op-
tion. Almost all Unix systems, including Linux, have a documentation sys-
tem derived from a manual originally called the Unix Programmer’s Manual.
This manual has numbered sections. Each section is a collection of manual
pages, also called manpages, and each program has its own manpage. Most
Unix/Linux installations have individual manpages stored on the computer.
Thus they can be accessed at anytime. To do so, use the command man (man-
ual). For example, if you want to find information about the command sort,
then you type “man sort”. Usually the output is directly sent to a page viewer
(pager) like less (see Sect. 6.1.3 on page 67). If not, you can pipe it to less
by typing “man sort|less”. You can then scroll through the text using the
, ↓
↑ , PgUp and PgDn keys. You leave the manpage by pressing Q . You
should note that manpages are not available for all entries. Especially com-
mands like cd, that are not separate Linux programs but part of the shell,
are not individually documented. However, you can find them in the shell’s
manpages with man bash.
The info command (information) command serves a similar purpose. How-
ever, the output format on the screen is different. To read through the output
press the space bar Space ; to quit the documentation press Q
. It is not pos-
sible to scroll through the output.
If you have a faint idea of a command use apropos keyword . This command
searches for the keyword in an index database of the manpages. The keyword
may also contain wildcards (see Sect. 7.10 on page 91).
3.3 Logout
The two most common ones are the command logout or the key combination
+D
Ctrl . You are done. If you work on your home computer you will see a
new “login:” prompt. You should login again as superuser (root) and halt the
system with the command halt. A lot of information will flush over the screen
until it says something like: “Run Level 0 has been reached”. Then you can
switch off the power, if it does not do so automatically.
Exercises
There is no other way to learn Linux than by working with it. Therefore you
must exercise and play around.
3.1. Login to your Unix-based computer. What information do you get from
your computer?
3.2. Write a command and use the erase key to delete characters. The erase
, Del
key differs from system to system and account to account. Try out BkSp
and Ctrl
+H
.
3.3. Use the command date and redirect the output into a file named the date.
Append the current date again to that file and print the resulting file content
onto the screen.
3.4. Change your password. After you have changed it successfully, you might
want to restore your old one. Do so!
Before people started working with computers they used to work at desks
(nowadays, people work with a computer on their lap on the sofa.) They had
information in files which were organized in different folders. Well, now we
work on virtual desktops and still use files and folders. Figure 4.1 gives an
example of how files are organized in directories.
Fig. 4.1. This sketch illustrates the organization of directories (folders) and files
in Unix/Linux. The root directory is shown at the top. Directories are underlaid in
grey. As an example, the path to the file gccontent.pl is shown
Files are the smallest unit of information the normal user will encounter.
A file might contain everything from nothing to programs to file archives. In
order to keep things sorted, files are organized in folders (directories). You
can name files and directories as you wish – but you must avoid the following
special characters:
You also should avoid to use the space character “”. In order to separate
words in filenames the underscore character “ ” is commonly used. Unix/Linux
discriminate between upper- and lowercase characters. A file named file can
be distinguished from a file called File. This is a very important difference to
Microsoft Windows.
In fact, in Unix/Linux everything is a file. Even devices like printers or the
screen are treated as if they were files. For example, on many systems the file
tty1 stands for the serial port number 1. Table 4.1 gives you a short description
of the most common system directories and their content.
Table 4.1. Parameter that can be used with the command chmod
Directory Content
/bin essential system programs
/boot kernel and boot files
/dev devices like printers and USB or serial ports
/etc configuration files
/home user’s home directories
/lib important system files
/lost&found loose file fragments from system checks
/mnt external drives like floppy disk
/proc system information
/root root’s home directory
/sbin administrative programs
/tmp temporary files
/usr static files like programs
/var variable files like log files
each file and directory and contains all parent directories up to the root di-
rectory “/”. Thus, line 2 of Terminal 9 on the facing page reads: the directory
Freddy resides in the directory home, which in turn resides in the directory root
(/). The name of the home directory always equals the username, here Freddy.
If you want to see what is in your home directory, type the command ls (list).
Terminal 10: ls
1 $ mkdir Awk_Programs
2 $ mkdir Text
3 $ mkdir .hidden
4 $ date > the.date
5 $ date > Text/the.date
6 $ ls Awk_Programs
7 Text the.date
8 $ ls -a
9 . .. Awk_Programs .hidden Text the.date
10 $ ls -l
11 total 12
12 drwxrwxr-x 2 rw rw 4096 Apr 1 16:50 Awk_Programs
13 drwxrwxr-x 2 rw rw 4096 Apr 1 16:51 Text
14 -rw-rw-r-- 1 rw rw 30 Apr 1 17:16 the.date
15 $ ls -lh
16 total 12K
17 drwxrwxr-x 2 rw rw 4.0K Apr 1 16:50 Awk_Programs
18 drwxrwxr-x 2 rw rw 4.0K Apr 1 16:51 Text
19 -rw-rw-r-- 1 rw rw 30 Apr 1 17:16 the.date
20 $ ls Text
21 the.date
22 $
Thus you can directly see what are files and what are directories. You may
have noticed the directories “.” and “..”. These are special directories. The
directory “.” stands for the working directory. This is useful in commands
like cp (copy). The “..” directory is the relative pathname to the parent
directory. This relative pathname is helpful when you change to the parent
directory with the command cd (change directory). The function of “.” and
“..” is shown in Terminal 11 on page 40. Now we want to get some more
information about the files and directories. We use the option -l (list). With
this option the command ls lists the files and folders and gives additional
information. First, in line 11, the size of the directory is shown. Here 12
Kilobyte are occupied. In line 15 we add the option -h (human). We obtain
the same output, but human-readable. Now we see that the number 12 means
12K, which reads 12 Kilobytes. In line 20, we list the content of the directory
Text. It contains the file the.date, which we saved at that location in line 5.
With the option -R (recursively) you can list the content of subdirectories
immediately (see Terminal 13 on page 45).
Fig. 4.2. Data obtained by the command ls -l. The numbers in italics correspond
to the description in the text
The following list describes the individual file attributes in more detail.
The numbers correspond to the italic numbers in Fig. 4.2.
1. Type
The first character indicates whether the item is a directory (d), a normal
file (-), a block-oriented device (b), a character-oriented device (c) or a
link (l). Actually, a directory is a special type of file. The same holds
for the special directories “.” and “..” which can be seen in line 9 in
Terminal 10 on the preceding page.
2. Access Modes
The access modes or permissions specify three types of users who are
either allowed to read (r), write (w) or execute (x) the file. The first block
4.3 Special Files: . and .. 39
Fig. 4.3. The two special files dot and dotdot can be found in every directory.
They are frequently used together with commands like cd (change directory) and
cp (copy)
Let us first check where we are. We use the command pwd (print working
directory). Then we create the directory temp and the file the.date in the
current directory /home/Freddy. Next, in line 5, we change into the directory
temp with the command cd (change directory). By using pwd we can prove
that we changed into the directory temp which we created in line 3. With the
ls command in line 8 we show that the directory is empty. Now we copy the
file the.date which we created in directory /home/Freddy into the currently
active directory /home/Freddy/temp (where we are right now). Therefore, we
apply the command cp (copy). The parent directory to the currently active
directory has the shortcut “..”, whereas the current active directory has the
shortcut “.”. The syntax of the copy command is “cp source destination”.
Thus, with ../the.date we call the file the.date which is one directory up. With
the single dot we specify the directory we are in. This is the destination. Thus
the whole command in line 9 copies the file the.date from /home/Freddy to
4.4 Protecting Files and Directories 41
/home/Freddy/temp. Note that you have duplicated the file. In order to move
the file you have to substitute the command cp for the command mv (move).
In line 12 we jump back to our home directory. Again we make use of the “..”
shortcut. Then, in line 15, we remove the directory temp and all its content.
Therefore we apply the command rm (remove) with the option -r (recursively).
We can use only rm if we want to erase one or more files. However, to erase
directories, even empty ones, you will need the option -r. Why is that? Well,
we just saw that even an empty directory contains two, though hidden, files:
“.” and “..”.
4.4.1 Directories
Access permissions to directories help to control access to the files and subdi-
rectories in that directory. If a directory has read permission (r), a user can
run the ls command in order to see the directory content. He can also use
wildcards (see Sect. 7.10 on page 91) to match files in this directory. If a direc-
tory has write permission (w), users can add, rename and delete files in that
directory and all subdirectories. To access a directory, that is to read, write
or execute files or programs, a user needs execute permission (x) on that di-
rectory. Note: To access a directory, a user must also have execute permission
to all of its parent directories, all the way up to the root directory “/”.
4.4.2 Files
Access permissions on a file always refer to the file’s content. As we saw in the
last paragraph, the access permissions of the directory where the file is located
control whether the file can be renamed or removed. Read permissions (r) and
write permissions (w) control whether you can read or change the file’s content,
respectively. If the file is actually a program it must have execute permission
(x) in order to allow a user to run the program.
4.4.3 Examples
Let us take a look at some examples in order to clarify the situation. Look
back at Terminal 10 on page 37 line 14. The first package of information,
42 4 Working with Files
- rwx rwx rwx A file which can be read and modified by everybody.
This setting gives no security at all.
- rw- r-- r-- This represents the standard setting for a file. The file
can be read and modified by the owner, but only be read
by all the others.
- r-- --- --- This is the most secure setting. The owner can read but
not modify the file. All the others can neither read nor
modify the file.
- rwx r-x r-x The standard setting for executable files. Everybody is
allowed to run the program but only the user can change
the file.
d rwx r-x r-x The standard setting for directories. The user is allowed
to access the directory and write and delete entries. All
the other users can access the directory and list its con-
tent. However, they are not allowed to write or delete
entries.
d rwx --x --x Here the user has the same rights as in the example
above. Other users are allowed to access the directory
(with the command cd); however, they cannot list the
content with the command ls. If the name of entries
is known to the other users, they can, depending on
the access rights of these entries, be read, written and
executed.
Now, in Terminal 12, we will play with file associations. Less common are the
commands “chown user file ” and “chgrp group file ”. They are used to
change the user and the group ownership of a file or directory, respectively.
The former commands can be executed only by the superuser root. We will
concentrate on the command “chmod mode file ”.
Table 4.2. Parameter that can be used with the command chmod
User Type Rights
u – user + add r – read
g – group - delete w – write
o – others x – execute
a – all
if only read and write permissions are to be set, the numbers add to 6 and so
forth.
In line 12 of Terminal 12 on the page before we apply the code. With the
command chmod 664 thedate we change the access permissions back to the
default values: the user and users of the same group are allowed to read (r)
and modify (w) the file thedate, whereas all the other users are only allowed
to read the file.
If you want to change the permissions for many files at once you just list them
one after the other: chmod 777 file1 file2 file3. If you use the wildcard
”*” instead of file names, you change permissions for all files in the directory at
once. If you use the number 0, you reset all permissions. For example, “chmod
660 *” sets the permissions of all files in the current directory to reading and
writing for you and your group only.
with Linux – at least not at the command line level. With Linux you first
create an archive of files or directories and then compress the archive file.
The most important command to create or extract an archive is tar (tape
archive). The name already implies what the main purpose of the command
is: to create a file archive for storage on a tape. Even nowadays, large data sets
are stored on magnetic tapes. The reason is the high density of information
that can be stored on magnetic tapes (maybe you also remember the Datasette
from the Commodore 64 times). Well, let us come back to the tar command.
With tar you can put many directories and files into one single file. Thus,
it is very easy to recover the original file organization. For that reason pro-
grams and other data are often distributed in the archived form. In your local
working directory you then extract the archive and get back the original file
and directory structure. As with many programs, in addition to the original
tar program a GNU version called gtar is available on many systems. With
gtar it is possible to compress files before they are put into the archive. By
this, around 50% of memory can be saved (and, if you think further, money!).
On most systems gtar completely substitutes for tar, although the command
name tar has been kept. Thus, you do not have to bother about gtar. The
output of tar can be sent either to a file or to a device like the floppy disk, a
CD-ROM burner or a streamer.
Now let us use tar. First go to your home directory (cd) and create a di-
rectory called tartest (mkdir). Change into the newly created directory (cd)
and create the file tartest file with date > tartest file. Then create the
directories tata1 and tata2. Within each of these two directories create 3 files
named tatax filey with the command man tar > tata1 file1 (x is the num-
ber of the directory and y the file number). By redirecting (>) the output of
the man command we create files with some content. You should end up with
a list of files shown in Terminal 13.
Terminal 13: tar
1 $ ls -R
2 .:
3 tartest_file tata1 tata2
4
5 ./tata1:
6 tata1_file1 tata1_file2 tata1_file3
7
8 ./tata2:
9 tata2_file1 tata2_file2 tata2_file3
10 $
./tata1 (line 5) we find the files tata1 file1 to tata file3 (line 6) and so on.
Before we start using the tar command we should look at some of its options:
Of course, there are many more options. Take a look at the manpages
(see Sect. 3.2.4 on page 33). As a big exception, the tar command takes the
options without any dashes (-). Note that you should give your archives an
appropriate filename extension: .tar for uncompressed and .tgz for compressed
archives. Otherwise, you will not recognize the file as an archive! Now let us
take a look at Terminal 14.
Terminal 14: tar
1 $ pwd
2 /home/Freddy/tartest
3 $ tar cvf ../daten.tar .
4 ./
5 ./tata1/
6 ./tata1/tata1_file1
7 ./tata1/tata1_file2
8 ./tata1/tata1_file3
9 ./tata2/
10 ./tata2/tata2_file1
11 ./tata2/tata2_file2
12 ./tata2/tata2_file3
13 ./tartest_file
14 $ tar cfz ../daten.tgz .
15 $ ls -lh ../daten*
16 -rw-rw-r-- 1 Freddy Freddy 100K Apr 19 15:55 ../daten.tar
17 -rw-rw-r-- 1 Freddy Freddy 5.7K Apr 19 15:55 ../daten.tgz
18 $
parent directory (..). The archive should contain everything which is in the
current working directory (.). We choose to follow the progress of the pro-
gram (v). In line 14 we create in the parent directory the archive daten.tgz,
which is compressed (z). Furthermore, we do not wish to follow the program’s
progress (no verbose). In line 15 we use the command ls to list all files in the
parent directory that begin with daten. Furthermore, we instruct ls to give
file details (-l) and print the file size human-readable (-h). As you can see,
the compressed archive is much, much smaller than the uncompressed version.
Now take a look at the archives daten.tar and daten.tgz by typing cat
../daten.tar |less and cat ../daten.tgz |zless, respectively. You can
scroll through the file content using ↑
, ↓ , PgUp and PgDn
. In order to get
back to the command line press Q .
zip/unzip
The most commonly used data-compression format is the ZIP format. ZIP
files use a number of techniques to reduce the size of a file, including file
48 4 Working with Files
shrinking, file reduction and file implosion. The ZIP format was developed
by Phil Katz for his DOS-based program PkZIP in 1986 and is now widely
used on Windows-based programs such as WinZip. The file extension given to
ZIP files is .zip. The content of a zipped file can be listed with the command
“unzip -l filename.zip ”.
gzip/gunzip
The GNU compression tool gzip is found on most Linux systems. Files are
compressed by “gzip filename ” and automatically get the extension .gz.
The original file is replaced by the compressed file. After uncompressing using
“gzip -d filename ” or “gunzip filename ” the compressed file is replaced
by the uncompressed file. In order to avoid replacement you can use the option
-c. This redirects the output to standard output and can thus be redirected
into the file. The command “gzip -c filename > filename2 ” creates the
compressed file filename2 and leaves the file filename untouched. With the
option -r you can recursively compress the content of subdirectories.
gzip is used when you compress an archive with tar z (see Sect. 4.5 on
page 44). Then you get the file extension .tgz.
bzip2/bunzip2
The command bzip2 is quite new and works more efficiently than gzip. Files
are usually 20 to 30% smaller compared to gzip compressed files.
compress/uncompress
The compression program compress filename is quite inefficient and not
widespread any more. Compressed files carry the extension .Z. Files can be
uncompressed using compress -d filename (decompress) or uncompress
filename .
Table 4.4 gives an overview of file extensions and their association with com-
pression commands.
Again, there are many more options available. You are always welcome to
learn more about the commands by looking into the manpages, in this case
by applying man find (see Sect. 3.2.4 on page 33). Let us perform a little
exercise in order to get a better feeling for the find command. Go to your
home directory (cd) and create a directory called find-test with the subdi-
rectory sub. Now create two files within find-test (date>find-test/1 and
date>find-test/2) and one file within find-test/sub (date>find-test/sub
/3.txt).
Terminal 15: find
1 $ mkdir find-test
2 $ mkdir find-test/sub
3 $ date>find-test/1
4 $ date>find-test/2
5 $ date>find-test/sub/3.txt
6 $ find find-test -type f -size -6 -name "[0-9]"
7 find-test/1
8 find-test/2
9 $ find find-test -type f -size -6 -name "[0-9]*"
50 4 Working with Files
10 find-test/1
11 find-test/2
12 find-test/sub/3.txt
13 $ find find-test -type f -size -6c -name "[0-9]*"
14 $ find find-test -type f -size -6 -name "[0-9]*"
15 -name "*.txt"
16 find-test/sub/3.txt
17 $ mkdir find-found
18 $ find find-test -type f -size -6 -name "[0-9]*"
19 -name "*.txt" -exec cp {} find-found \;
20 $ ls find-found/
21 3.txt
22 $
Exercises
The following exercises should strengthen your file-handling power.
4.1. Go to your home directory and check that you are in the right directory.
List all files in your home directory. List all the files in that directory including
the hidden files. List all the files including the hidden files in the long format.
4.2. Determine where you presently are in the directory tree. Move up one
level. Move to the root directory. Go back to your home directory.
4.3. Create a directory called testdir in your home directory, and a directory
named subdir in the testdir directory (testdir/subdir ). List all the files in the
testdir/subdir subdirectory, including the hidden files. What are the entries
you see? Remove the subdir subdirectory in your home directory.
4.7 Searching for Files 51
4.4. Using the cd and the ls commands, go through all directories on your
computer and draw a directory tree. Note: You may not have permission to
read all directories. Which are these?
4.5. What permissions do you have on the /etc directory and the /etc/passwd
file? List all the files and directories in your /home directory. Who has which
rights on these files, and what does that mean? Print the content of the file
/etc/passwd onto the screen.
4.6. In your home directory create a directory named testdir2. In that direc-
tory create a file containing the current date. Now copy the whole directory
with all its content into the folder testdir. Remove the old testdir2. Check that
you have successfully completed your mission.
4.7. Create the file now with the current date in the directory testdir. Change
permissions of this file: give yourself read and write permission, people in your
group should have read, but no write permission, and people not in your group
should not have any permissions on the file.
4.8. What permissions do the codes 750, 600 and 640 stay for? What does
chmod u=rw,go-rwx myfile do? Change permissions of all files in the testdir
directory and all its subdirectories. You should have read and write permis-
sion, people in your group have only reading permission, and other people
should have no permissions at all.
4.9. Create a new directory and move into it with a one-liner.
4.10. Use different compression tools to compress the manual page of the
sort command. Calculate the compression efficiencies in percent.
4.11. Create the directory testpress in your home directory. Within testpress
create 3 filled files (redirect manpages into the files). From your home direc-
tory use the commands (un)compress, (un)zip, g(un)zip and b(un)zip2
to compress and uncompress the whole directory content (use the option -r
(recursively)), respectively. What does the compressed directory look like?
4.12. Create a hidden file. What do you have to do in order to see it when
you list the directories content?
4.13. Create a number of directories with subdirectories and files and create
file archives. Extract the file archives again. Check whether files are overwrit-
ten or not.
4.14. Play around with copying, moving and renaming files. What is the dif-
ference between moving and renaming files? What happens to the file’s time
stamp when you copy, move or rename a file?
4.15. Create a directory in a directory, a so-called subdirectory. What happens
to the number of links?
52 4 Working with Files
4.16. Create a file in a directory. Change this file. What happens to the mod-
ification date of the directory? Add a new file. What happens now? Rename
a file with the command mv (move). What happens now?
4.17. Play around with the find command. Create directories, subdirectories
and files and try out different attributes to find files.
5
Installing BLAST and ClustalW
In this chapter you will learn how to install small programs. As examples, we
are using BLAST [1] (Basic Local Alignment Search Tool) and ClustalW [14].
BLAST is a powerful tool to find sequences in a database. Assume you have
sequenced a gene and now want to check whether there are already similar
genes sequenced by somebody else. Then you “blast” your sequence against
an online database and get similar sequences, if present, as an output. Now
assume you have found 10 similar genes. Of course, you would like to find
regions of high similarity, that is, regions where these genes are conserved.
For this, one uses ClustalW. ClustalW is a general-purpose multiple-sequence
alignment program for DNA or protein sequences. It produces biologically
meaningful multiple-sequence alignments of divergent sequences. ClustalW
calculates the best match for the selected sequences and lines them up such
that the identities, similarities and differences can be seen. Then evolutionary
relationships can be visualized by generating cladograms or phylograms.
Both programs use different installation procedures. BLAST comes as a
packed and compressed archive that needs only to be unpacked. ClustalW
comes as a packed and compressed archive, too. However, before you can run
the program it needs to be compiled. This means you get the source code of
the program and must create the executable files from it.
Internet. The program allows a user to transfer files to and from a remote
network site. In fact, it is a very powerful program with many options. We
will use only a small fraction of its capabilities. If you wish, take a look at the
manual pages (man ftp).
We will not download the newest version of BLAST, which is at the time
of writing these lines version 2.2.8, but version 2.2.4. This version has less
dependencies than the newer ones and should run on all typical Unix/Linux
installations. Now, let us start. Go into your home directory by typing cd
(remember: using the command cd without any directory name will auto-
matically bring you to your home directory). Make a directory for BLAST
by typing mkdir blast. Type cd blast to change into the blast direc-
tory. Then type “ftp ftp.ncbi.nih.gov”. With this, you tell Linux to con-
nect to the file server named ftp.ncbi.nih.gov using the file transfer proto-
col (ftp). Enter anonymous as the username. Enter your e-mail address as
the password. Type bin to set the binary transfer mode. Next, type cd
blast/executables/release/2.2.4/ to get to the right directory on the
remote file server and then type ls to see a list of the available files. You
should recognize a file named blast-2.2.8-ia32-linux.tar.gz. Since we are work-
ing on a Linux system, this is the file we will download. If, instead, you work
on a Unix workstation or with Mac OS X Darwin, you should download blast-
2.2.4-ia32-solaris.tar.gz or blast-2.2.4-powerpc-macosx.tar.gz, respectively. To
download the required file type the command get followed by the name of the
file for your system, e.g. get blast-2.2.8-ia32-linux.tar.gz. Finally, type
quit to close the connection and stop the ftp program. The file blast-2.2.8-
ia32-linux.tar.gz should now be in your working directory. Check it using the
command ls.
Terminal 16: Downloading BLAST
1 $ ftp ftp.ncbi.nih.gov
2 Connected to ftp.ncbi.nih.gov.
3 Public data may be downloaded by logging in as
4 "anonymous" using your E-mail address as a password.
5 220 FTP Server ready.
6 Name (ftp.ncbi.nih.gov:rw): anonymous
7 331 Anonymous login ok,
8 send your complete email address as your password.
9 Password:
10 230 Anonymous access granted, restrictions apply.
11 Remote system type is UNIX.
12 Using binary mode to transfer files.
13 ftp> bin
14 200 Type set to I
15 ftp> cd blast/executables/release/2.2.4/
16 250 CWD command successful.
5.1 Downloading the Programs via FTP 55
17 ftp> ls
18 500 EPSV not understood
19 227 Entering Passive Mode (130,14,29,30,197,39).
20 150 Opening ASCII mode data connection for file list
21 -r--r--r-- 1 ftp anonymous 15646697 Aug 29 2002
22 blast-2.2.4-ia32-linux.tar.gz
23 ...
24 -r--r--r-- 1 ftp anonymous 41198447 Aug 29 2002
25 blast-2.2.4-powerpc-macosx.tar.gz
26 226 Transfer complete.
27 ftp> get blast-2.2.4-ia32-linux.tar.gz
28 local: blast-2.2.4-ia32-linux.tar.gz
29 remote: blast-2.2.4-ia32-linux.tar.gz
30 227 Entering Passive Mode (130,14,29,30,196,216).
31 150 Opening BINARY mode data connection for
32 blast-2.2.4-ia32-linux.tar.gz (15646697 bytes)
33 226 Transfer complete.
34 15646697 bytes received in 02:53 (87.96 KB/s)
35 ftp> quit
36 221 Goodbye.
37 $
3 ...
4 Name (ftp.ebi.ac.uk:rw): anonymous
5 331 Guest login ok,
6 send your complete e-mail address as password.
7 Password:
8 230-Welcome anonymous@134.95.189.5
9 ...
10 230 Guest login ok, access restrictions apply.
11 Remote system type is UNIX.
12 Using binary mode to transfer files.
13 ftp> cd pub/software/unix/clustalw
14 ...
15 ftp> get clustalw1.83.UNIX.tar.gz
16 local: clustalw1.83.UNIX.tar.gz
17 remote: clustalw1.83.UNIX.tar.gz
18 500 ’EPSV’: command not understood.
19 227 Entering Passive Mode (193,62,196,103,227,137)
20 150 Opening BINARY mode data connection for
21 clustalw1.83.UNIX.tar.gz (166863 bytes).
22 100% |****************| 162 KB 151.41 KB/s 00:00 ETA
23 226 Transfer complete.
24 166863 bytes received in 00:01 (148.60 KB/s)
25 ftp> pwd
26 257 "/pub/software/unix/clustalw" is current directory.
27 ftp> !pwd
28 /home/emboss/Freddy/clustal
29 ftp> !ls
30 clustalw1.83.UNIX.tar.gz
31 ftp> quit
32 221-You have transferred 166863 bytes in 1 files.
33 ...
34 221 Goodbye.
35 $
Be sure you are in the directory blast and that you correctly downloaded
the file blast-2.2.8-ia32-linux.tar.gz into it as described in Section 5.1.1 on
page 54. Now, type
gunzip blast-2.2.8-ia32-linux.tar.gz
Next, type
tar -xf blast-2.2.8-ia32-linux.tar
This will extract the archive into the current working directory (see Sect. 4.5 on
page 44). You are done! BLAST is distributed with ready-to-go executable
files. You can run the program directly. To run, for example, the program
blastall you just type in the current directory ./blastall. Why do you have
to precede the command with “./”? Well, the program is not registered in the
path of the system. The path is a variable that contains all directories where
the system searches for executable programs (see Sect. 8.2 on page 99). By
typing ./ you explicitly define the location of the command. From the parent
directory of blast you would have to call the program with blast/blastall.
Of course, blastall does not do too many interesting things at this stage
since we have not provided any data for it. Instead, the program displays all
its options. However, it demonstrates that the program is alive.
Now the interesting part starts! We query the database for our query sequence,
which is in the file called testblast. Therefore, we apply the command
The first option (-p blastn) defines the query program. With blastn we
choose the normal program to query nucleotide sequences. Furthermore, we
specify the name of the database (testdb) and the name of the file containing
the query sequence (testblast ). (You could also enter several sequences into
the query file. They would be processed one after the other. It is, however,
important that you use the FASTA format for the file.) The output starts
in line 13. Note that, for printing reasons, the output in Terminal 18 on the
preceding page is shorter than in reality. The sequence was actually found
three times in sequence 1.
gunzip clustalw1.83.UNIX.tar.gz
followed by
tar -xf clustalw1.83.UNIX.tar
A new directory named clustalw1.83 will be created. Change into this di-
rectory (cd clustalw1.83). Now we start the compilation process: just type
make. You will see a number of lines of the type cc -c -O interface.c popping
up. Finally you get back to your shell prompt. When you now list (ls) the
directory’s content, you will recognize a number of new files with the file ex-
tension .o and one new executable file called clustalw. That is the program
compiled for your system! How do you recognize an executable file? Take a
look back into the section on file attributes (see Sect. 4.2 on page 38).
13 gene2 GGGCCCTTAGCTCAGCTGGGAGAGA-
14 gene3 GGGGGCGTAGCTCAGCTGGGAGAGA-
15 gene1 GGGCTCATAGCTCAGC-GGTAGAGTG
16 *** * ********* ** ****
17 $
In Terminal 19 we first create a file called tRNA, which contains the se-
quences we want to align. Therefore, we apply the command cat as explained
in Section 6.1 on page 64. In line 8 of Terminal 19 we actually start the
program ClustalW with
./clustalw tRNA
We get a lot of output on the screen, which is omitted here. The alignment
is not printed out onto the screen but written into the file tRNA.aln. We
can take a look at the alignment using cat tRNA.aln (see line 10). There
are many more options for ClustalW available. Moreover, in addition to the
alignment, a phylogenetic tree is created; but this is not our focus here.
Exercises
In these exercises you will download and compile the program tacg [8]. In or-
der to do so, you require to be connected to the internet. If you succeed, you
can already consider yourself as a little Linux freak. tacg is a command-line
program that performs many of the common routines in pattern matching in
biological strings. It was originally designed for restriction enzyme analysis
and while this still forms a core of the program, it has been expanded to fulfil
more functions. tacg searches a DNA sequence read from the command line
or a file for matches based on descriptions stored in a database of patterns.
These descriptions can, e.g., be formatted as explicit sequences, matrix de-
scriptions or regular expressions describing restriction enzyme cutting sites,
transcription factor binding sites or what ever. The query result is sent to
the standard output (the screen) or a file. With tacg you can also translate
DNA sequences to protein sequences or search for open reading frames in any
5.5 Running ClustalW 61
5.1. Connect to the FTP server ftp.sunet.se, login anonymously and change to
the following directory: pub/molbio/restrict-enz/. List the directory content
and download the file tacg-3.50-src.tar.gz into your home directory. Log off
and move the file into a directory called dna grep.
5.3. In this exercise you perform the compilation step. Follow the instruc-
tions given in the file dna grep/tacg-3.50-src/INSTALL. Check if you were
successful by typing ./tacg.
5.4. If you wish, you can take a look at the documentation in the subfolder
Docs and play around a little.
6
Working with Text
After learning some basics in the previous chapter, you will learn more ad-
vanced tools from now on. Since you are going to learn programming, you
should be able to enter a program, that is, a text. Certainly your programs
will not run immediately. This means you need to edit text files, too. In this
chapter you will use very basic text-editing tools. We will concentrate on the
text editor vi.
With Linux you have the choice from an endless list of text editors. From
Microsoft Windows you might have heard about the Notepad. That is the
standard Windows text editor. With Linux you can choose between ed, vi,
vim, elvis, emacs, nedit, kedit, gedit and many more. In general, we can
distinguish between three different types of text editors: a) One group of text
editors is line-orientated. ed belongs to this group. You can only work on one
line of the text file and then need a command to get to the next line. This
means that you cannot use the arrow keys to move the cursor (your current
position) through the text. This is stone age text editing and good only for
learning purposes. b) The next group of text editors is screen-oriented. vi be-
longs to this group of editors. You see the text over the whole screen and can
scroll up and down and move to every text position using the arrow keys. This
is much more comfortable than using a line-oriented editor. We will predom-
inantly work with vi. It is powerful and comfortable enough for our purpose
and usually available on all systems. You can even use it when you have no
X-Server (see Sect. 2.3 on page 14) running, which is often the case when you
login to a remote computer. c) The most comfortable text editors, of course,
use the X-Server. However, you should not confuse comfortable with powerful!
nedit or kedit are examples of editors which use the graphical user interface
(GUI). Here you can use the mouse to jump around in the text and copy and
paste (drag and drop) marked text. We are not going to work with these nice
editors.
64 6 Working with Text
Now, let us start with a very easy method to input text. For this, we use
the command cat (concatenate). We have already used cat for displaying the
content of a text file (see for example Terminal 5 on page 30). We have also
learned about redirecting the output of command into a file by using >. Now,
when you use cat without specifying a filename you redirect the standard
input (the keyboard) to the standard output (the screen). Let us try it out.
Terminal 20: cat
1 $ cat
2 this is text, now press Enter
3 this is text, now press Enter
4 now press Ctrl-Dnow press Ctrl-D now press Enter AND Ctrl-D
5 now press Enter AND Ctrl-D
6 $
now create a text file named genomes.txt, which we are going to use and play
around with. Remember: When you mistype something you can go back with
the BkSp key – but only in the active line!
Now let us see what we can do with the text file. We have already learned
something about sorting the content of a text file in Terminal 7 on page 31.
Let us take a look at it again.
With sort you can, as the name implies, sort the content of a text file. This
can be very helpful in order to make files more readable. The sort command
comes with some useful options, which are, as usual, separated by a dash.
7 4 lines text?
8 A. thaliana (plant) - 100,000,000 bp - 25,000 genes
9 E. coli (bacteria) - 4,670,000 bp - 3237 genes
10 H. sapiens (human) - 3,400,000,000 bp - 30,000 genes
11 S. cerevisiae (yeast) - 12,100,000 bp - 6034 genes
12 $ sort -u genomes.txt | sort -n
13 A. thaliana (plant) - 100,000,000 bp - 25,000 genes
14 E. coli (bacteria) - 4,670,000 bp - 3237 genes
15 H. sapiens (human) - 3,400,000,000 bp - 30,000 genes
16 S. cerevisiae (yeast) - 12,100,000 bp - 6034 genes
17 4 lines text?
18 203 characters?
19 $ sort -u genomes.txt | sort -n > sorted-genomes.txt
20 $
Terminal 25: wc
1 $ wc genomes.txt
2 7 44 247 genomes.txt
3 $
As we can see from the output in Terminal 25, the file genomes.txt consists
of 7 lines, 44 words and 247 characters.
grep really is much more powerful than you can imagine right now! It
comes with a whole bunch of options, only one of which was applied in Ter-
minal 26. In line 1 we search for the occurrences of the word human in the file
genomes.txt. The matching line is printed out. In line 3 we search for the word
genes and pipe the matching lines to the program wc to count the number of
words (option -w) of these lines. The result is 36. Next, in line 5, we count the
number of lines (wc option -l) in which the word genes occurs; but we can
obtain this result much more easily by using the grep option -c (count). With
this option grep displays only the number of lines in which the query expres-
sion occurs. How can we search for two consecutive words? This is shown in
line 12. Notice that we enclose the words we are searching for in single quotes
(you could also use double quotes); and how about querying for two words
that are not necessarily in one line? Use the combination “\|” between the
words as shown in line 13 of Terminal 26. This stands for the logical or. Be
careful not to include spaces!
6.1 A Quick Start: cat 69
At this stage you should already have a feeling about the power of the Linux
command line. We will get back to grep when we talk about regular expres-
sions in Chapter 9 on page 127.
Two interesting commands to compare the contents of text files are diff and
comm. To see how these commands work, create two text files with a list of
words, as shown in the following Terminal.
Terminal 27: diff
1 $ cat>amino1
2 These amino acids are polar:
3 Serine
4 Tyrosine
5 Arginine
6 $ cat>amino2
7 These amino acids are polar and charged:
8 Lysine
9 Arginine
10 $ diff amino1 amino2
11 1,3c1,2
12 < These amino acids are polar:
13 < Serine
14 < Tyrosine
15 ---
16 > These amino acids are polar and charged:
17 > Lysine
18 $ diff -u amino1 amino2
19 --- amino1 2003-05-11 18:42:53.000000000 +0200
20 +++ amino2 2003-05-11 18:43:30.000000000 +0200
21 @@ -1,4 +1,3 @@
22 -These amino acids are polar:
23 -Serine
24 -Tyrosine
25 +These amino acids are polar and charged:
26 +Lysine
27 Arginine
28 $ diff -c amino1 amino2
29 *** amino1 2003-05-11 18:42:53.000000000 +0200
30 --- amino2 2003-05-11 18:43:30.000000000 +0200
31 ***************
32 *** 1,4 ****
33 ! These amino acids are polar:
34 ! Serine
35 ! Tyrosine
36 Arginine
37 --- 1,3 ----
70 6 Working with Text
The result of the command diff (difference) indicates what you have to
do with the file amino1 to convert it to amino2. You have to delete the lines
marked with “<” and add the lines marked with “>”. With the option -u
(line 18) the same context is shown in another way. The option -c (line 28)
is used to display the differences only (indicated by a “!”).
The command comm (compare) requires that the input files are sorted. We do
this in line 1 in Terminal 28. Note: We can write several commands in one
line, separating them with the semicolon character (;).
The comm command prints out its result in three columns. Column one
contains all lines that appear only in the first file (amino1s), column two
shows all lines that are present only in the second file (amino2s) and the
third column contains all files that are in both files. You can restrict the
output with the options -n, n being one or more column numbers, which are
not to be printed. Line 9 in Terminal 28 displays only the content of the third
column (“minus 1 and minus 2”), which contains the common file content.
6.2 pico
Probably the easiest-to-handle text editor is called pico (pine composer). It
has the same look and feel as an old DOS editor (see Fig. 6.1 on the next
page). However, pico is not installed on all Unix/Linux systems. Thus, you
should make the effort to learn vi (see Sect. 6.3 on page 72), which is a bit
harder but more universal and which offers you many more possibilities (in
fact, you need to memorize only 6 commands in order to work vi). The pico
editor was developed by the University of Washington. You start pico with
the command pico and, optional, with a filename. If the file already exists,
6.2 pico 71
Editing commands are given to pico by typing special control key se-
quences. A circumflex, ^, is used to denote the Ctrl key. Thus, “^X Exit”
translates to: press Ctrl
+X to exit the program. You can press Ctrl
+G to
get a help text. This help text gives you more information about all available
commands.
If you cannot wait to use vi and make your first text file: here you go. Type
vi to start the editor. Press i to start the insertion mode and enter your text.
You start a new line by hitting Enter
. With Esc you leave the insertion mode.
6.3 vi and vim 73
With :wq filename you quit vi and save your text in a file called filename
whereas with :q! you quit without saving. That’s it. vi can be simple!
6.3.2 Starting vi
Now let us get serious. To start vi, enter: vi filename , where filename is
the name of the file you want to edit. If the file does not exist, vi will create
it for you. You can also start vi without giving any filename. In this case, vi
will ask for one when you quit or save your work. After you called vi, the
screen clears and displays the content of the file filename. If it is a new file,
it does not contain any text. Then vi uses the tilde character (∼) to indicate
lines on the screen beyond the end of the file. vi uses a cursor to indicate
where your next command or text insertion will take effect. The cursor is the
small rectangle, which is the size of one character, and the character inside
the rectangle is called the current character. At the bottom of the window, vi
maintains an announcement line, called the mode line. The mode line lists the
current line of the file, the filename and its status. Let us now start vi with
the new file text.txt. The screen should then look like Terminal 29. Please note
that I have deleted some empty lines in order to save rain forest, i.e. paper.
The cursor is represented by [].
Terminal 29: vi
1 []
2 ~
3 ~
4 ~
5 ~
6 ~
7 ~
8 ~
9 ~
10 "text.txt" [New File] 0,0-1 All
6.3.3 Modes
Line 10 in Terminal 29 shows the mode line. At this stage you cannot enter
any text because vi runs currently in the command mode. In order to enter
text, the input mode must be activated. To switch from the command mode
to the input mode, press the i key (you do not need to press Enter
). vi lets
you insert text beginning at the current cursor location. To switch back to
command mode, press Esc . You can also use Esc to cancel an unfinished
command in command mode. If you are uncertain about the current mode,
you can press Esc a few times. When vi beeps, you have returned to the
command mode. Okay, let us change to the input mode and enter some text.
74 6 Working with Text
Terminal 30: vi
1 This is new text in line 1. Now I press ENTER
2 and end up in the second line. I could also write to the e
3 nd of the line. The text will be wrapped automatically[]
4 ~
5 ~
6 ~
7 ~
8 ~
9 ~
10 -- insert -- 3,54 All
You can see some changes in the mode line in Terminal 30. “-- insert
--” indicates that you are in the input mode. Furthermore, the current cur-
sor position (line 3, column 54) is indicated. Now press Esc to get back into
the command mode, “-- insert --” will disappear. Now let us save the file:
press :w, and then Enter
. The mode line will display a message as shown in
Terminal 31. If “:w” appears in your text you are still in the input mode!
Terminal 31: vi
1 This is new text in line 1. Now I press ENTER
2 and end up in the second line. I could also write to the e
3 nd of the line. The text will be wrapped automatically
4 ~
5 ~
6 ~
7 ~
8 ~
9 ~
10 "text.txt" [new] 3L, 160C written 3,54 All
Commands are very often preceded with the colon “:” character. Let us
try another command: type :set number. Now you see line numbers in front
of each line. Another command: type :r !ls. Woop. After hitting Enter you
have a list of all files in your current working directory imported to your text
file. That is magic, isn’t it? vi run the shell command ls and imported the
result into the text file at the current cursor position.
You have now learned some powerful tools for moving around in a file. You
should memorize only the basic movements and know where to look up the
others (you can download a cheat sheet at www.kcomputing.com/vi.html ).
Maybe the most relaxing thing to know is that you can always undo changes
by typing u. With vim you can even undo many commands whereas vi will
recover only the last text change. If you are lucky, you can use the keys BkSp
and Del in order to make deletions in the input mode. Otherwise you must
make use of the commands in the command modus. To do so, first move the
cursor so that it covers the first character of the group you want to delete,
then type the desired command from the list below.
Notice that the second letter of the command specifies the same abbre-
viations as the cursor movement commands do. In fact, you can use delete
with all of the cursor movement specifiers listed above, e.g. dH would delete
everything from the current line to the top line off the screen.
In other cases you will need only to replace a single character or word, rather
than deleting it. vi has change and replace functions, too. First move to the
position where the change should begin (the desired line or the beginning of
the desired word). Next, type the proper command from the list below. Fi-
nally, enter the correct text, usually concluded with Esc (except for r).
76 6 Working with Text
cw Change a word.
C Overwrite to the end of the line.
r Replace a single character with another one. No Esc nec-
essary.
R Overwrite characters starting from the current cursor posi-
tion.
s Substitute one or more characters for a single character.
S Substitute the current line with a new one.
:r file Insert an external file at the current cursor position.
The change command c works like the delete command; you can use the
text portion specifiers listed in the cursor movement list.
vi provides several means of saving your changes. Besides saving your work
before quitting, it is also a good idea to save your work periodically. Power
failures or system crashes can cause you to lose work. From the command
mode, you type :w (write) and hit Enter . In order to save the text in a new
file, type :w filename . You quit vi with :q. You can save and quit at once
with :x or :wq. If you do not want to save your changes you must force
quitting with :q!. Be cautious when abandoning vi in this manner because
any changes you have made will be permanently lost.
Up to this point you have learned more than enough commands to use vi in a
comfortable way. The next two sections explain some more advanced features
which you might wish to use.
Frequently, you will need to cut or copy some text, and paste it elsewhere into
your document. Things are easy if you can work with the mouse. When you
mark some text with the mouse (holding the left mouse button) the marked
text is in the memory (buffer ). Pressing the right mouse button (or, on some
systems, the left and right or the middle mouse buttons) pastes the text at the
current cursor position. You can apply the same mechanism in the terminal
window!
Things are bit more complicated if you have only the keyboard. First you cut
or copy the text into temporary storage, then you paste it into a new loca-
tion. Cutting means removing text from the document and storing it, while
copying means placing a duplicate of the text in storage. Finally, pasting just
puts the stored text in the desired location. vi uses a buffer to store the tem-
porary text. There are nine numbered buffers in addition to an undo buffer.
The undo buffer contains the most recent delete. Usually buffer 1 contains
the most recent delete, buffer 2 the next most recent and so forth. Deletions
6.3 vi and vim 77
older than 9 disappear. However, vi also has 26 named buffers (a-z). These
buffers are useful for storing blocks of text for later retrieval. The content of a
buffer does not change until you put different text into it. Unless you change
the contents of a named buffer, it holds its last text until you quit. vi does
not save your buffers when you quit.
The simplest way to copy or move text is by entering the source line numbers
and the destination line numbers. The m command moves (cuts and pastes)
a range of text, and the t command transfers (copies and pastes) text. The
commands have the syntax shown below:
Another way is to use markers. You can mark lines with a letter from a
to z. These markers behave like invisible bookmarks. To set a mark you use
mx, with x being a letter from a to z. You can jump to a mark with ’x. The
following list shows you how to apply bookmarks to copy or move text. Note:
Bookmarks and line numbers can be mixed.
One last method uses the commands d (delete) or y (yank). With this
method you can make use of different buffers. Go to the line you wish to copy
or cut and press yy (yank) or dd (delete), respectively. Then move the cursor
to the line behind which you want to insert the text and type p (paste). In
order to copy a line into a buffer type "x yy, with x being a letter from a-z. You
insert the buffer with "x p. To copy more than one line precede the command
yy with the number of lines. 2yy copies 3 lines and 3yw copies 3 words. You
see, vi is very flexible and you can combine many commands. If you are going
to work a lot with it you should find out for yourself which commands you
prefer.
78 6 Working with Text
Finally, let us talk about another common issue: searching and replacing text.
As files become longer, you may need assistance in locating a particular in-
stance of text. vi has several search and search-and-replace features. vi can
search the entire file for a given string of text. A string is a sequence of char-
acters. vi searches forward with the slash (/) or backward with the question
mark key (?). You execute the search by typing the command, then the string
followed by Enter . To cancel the search, press Esc instead of Enter
. You can
search again by typing n (forward) or N (backward). Also, when vi reaches
the end of the text, it continues searching from the beginning. This feature
is called wrapscan. Of course, you can use wildcards or regular expressions in
your search. We will learn more about this later in Section 9.3 on page 139.
Let us take a look at the search-and-replace commands:
If you have followed this section about vi up to this stage, you should have
obtained a very good overview of its capabilities. It can do more and it offers
a whole range of options that one could set to personal preferences. However,
since I do not believe that anyone is really going deeper into this, I stop at
this point. You are welcome to read some more lines about vi. There are even
whole books dedicated to its application [6, 9, 11].
Exercises
Now sit down and play around with some text. This is elementary!
6.1. Create a text file named fruits.txt using cat. Enter some fruits, one in
each line. Append some fruits to this file.
6.3 vi and vim 79
6.2. Create a second text file named vegetable containing a list of vegetables,
again, one item per line. Now concatenate fruits and vegetables onto the screen
and into a file named dinner.
6.4. Take some time and exercise with vi. Open a text document with vi or
type some text and go through the description in this section. You must know
the basic commands in order to write and edit text files!
7
Using the Shell
The power of Linux lies in its shell. Of course, it is nice and comfortable to
run and control programs with a graphical user interface and the mouse. The
maximum of flexibility, however, you gain from the shell. In this chapter you
will learn some basic features of the shell environment.
Until now we have always talked about the shell. However, there are many
shells around. We will mainly work with the bash shell. I wrote bash like a
command, because it is a command. Whenever you log into a system you can
type bash. This opens a new bash shell for you. You can exit the shell with
exit. When you type exit in your login shell (the shell you are in after you
have logged in to the system) you actually logout! When you try to logout
from a shell other than your login shell you will see an error message.
Terminal 32 and Fig. 7.1 illustrate how different shell levels are opened.
All these shells are called interactive shells because they wait for your input.
Fig. 7.1. From the login shell (dark grey) one can open new shells (light grey). In
order to logout again one has to “go back” to the login shell by exiting all opened
shells
In Terminal 32 the user Freddy logs in to the system and ends up in the
login shell in line 4. Now, Freddy opens a new bash shell with the command
bash. If he tries to logout from this shell, Freddy sees an error message (line
6). However, he can exit the shell with the exit command and gets back to
7.3 Setting the Default Shell 83
his login shell in line 9. From here, Freddy could logout with the command
logout. Now you see the difference between logout and exit.
The bash shell is one of the modern shells. The name is an acronym, standing
for Bourne-Again Shell. The first shell was the Bourne shell developed in 1978
by Steve Bourne from the Bell Laboratories. All other shells were developed
later. Nowadays, you will find on most systems the Bourne Shell (sh), C Shell
(csh) and Bash Shell (bash). You can see in Terminal 32 on the facing page
how to enter these shells with their respective command. If a shell is not
installed on the system, you will see an error message as in line 13. In line 12
Freddy tried to open the Korn Shell (ksh) but found that it is not installed on
his system. Terminal 32 on the preceding page also displays differences of the
command line prompt (that is something like “[Freddy@rware2 Freddy]$”).
In all other terminals we omitted this part and you will just see the $ character
(take a look at Sect. A.3.2 on page 267 for an example of how to change the
shell prompt). Here, the bash prompt says that the user Freddy is logged in
to the system rware2 and is currently in his home directory called Freddy.
In contrast, the Bourne shell prompt in line 10 shows only the version of the
shell, and the C shell prompt in line 15 looks similar to the bash prompt.
All these prompts can be set up according to your personal preferences. If
you do not know in which shell you are, type echo $0 as shown in line 17 of
Terminal 32 on the preceding page. This command is available in all shells.
When you login to your system you can use the command echo $0 to check
out your default shell. You can easily change your default shell with the com-
mand chsh (change shell).
In Terminal 33 on the preceding page we see how to find out which shells
are installed on your system. There is a list available in the file /etc/shells.
On my system, I can choose between 8 different shells. In order to change the
login shell, type chsh. Now you have to enter your password and give the path
to your desired shell. The new shell will be activated after a new login.
history|grep -i keyword
This will read the history file and perform a case-insensitive search on the
keyword you are looking for.
Terminal 34: Query Command History
1 $ history|grep -i sort
2 140 man sort
3 141 info sort
4 143 info sort
5 816 sort amino1>amino1s; sort amino2>amino2s
6 1060 ls -l |sort +4
7 1066 ls -l |sort +4 -nr
8 1077 alias llss="lls|sort +4 -n"
9 1085 history|grep -i sort
10 $ !1060
11 ls -l |sort +4 -nr
12 -rwxrw-r-- 1 Freddy Freddy 321 Mai 17 11:54 spaces.sh
13 -rw-rw-r-- 1 Freddy Freddy 59 Mai 17 12:40 err.txt
14 -rw-rw-r-- 1 Freddy Freddy 11 Mai 17 13:51 list.txt
7.5 Redirections 85
Terminal 34 demonstrates its use. In line 1 we search for all past commands
in the history file that used the command sort. There are 8 hits, the most
recent one being at the bottom in line 9. Each command is preceded by an
unambiguous identifier. We can use this identifier to execute the command.
This is done in line 10. Note that you need to precede with an exclamation
mark (!). This is cool, isn’t it?
The following list shows you a couple of shortcuts that work in the bash shell:
↑ and↓
Scroll up and down in the command history
PgUp and PgDn Jump to the start or end in the command his-
tory
← and → Move forth and back in the command line
+B Strg
Strg +F As ← and →
+B and Alt
Alt +F Move forth and back word-wise
Home and
End Move to the beginning or end of the command
line
+A and Ctrl
Ctrl +E As Home and End
Strg + Clear the screen
L
+R
Strg Query command history
Tab Expand commando or filenames
The last shortcut in this list deserves our special attention. After you have
entered the first characters of a command, you can press the tabulator key
. If the command is already unambiguously described by the characters
Tab
you entered, the missing characters will be attached automatically. Otherwise
you hear a system beep. When you press the tabulator key twice, all possible
command names will be displayed. The same functions with filenames. Try it
out! This is really a comfortable feature!
7.5 Redirections
We have already used redirections in previous sections. The main purpose
was to save the output of a command into a file. Now let us take a closer
look at redirections. In Unix/Linux, there are three so-called file descriptors:
standard input (stdin), standard output (stdout) and standard error (stderr).
By default, all programs and commands read their input data from the stan-
dard input. This is usually the keyboard. When you enter text using the text
editor vi, the text comes from the keyboard. The data output of all programs
is sent to the standard output, which by default is the screen. When you use
the command ls, the result is printed onto the screen (or the active terminal
86 7 Using the Shell
when you use X-Windows). The standard error is displayed on the screen,
too. If you try to execute the non-existing command lss, you will see an error
message on the screen.
It is often convenient to be able to handle error messages and standard out-
put separately. If you do not do anything special, programs will read standard
input from your keyboard, and they will send standard output and standard
error to your terminal’s display. The shell allows you to redirect the standard
input, output and error. Basically, you can redirect stdout to a file, stderr to
a file, stdout to stderr, stderr to stdout, stderr and stdout to a file, stderr and
stdout to stdout, stderr and stdout to stderr.
As already mentioned above, standard input normally comes from your key-
board. Many programs ignore stdin. Instead, you enter e.g. filenames together
with the command. For instance, the command cat filename never reads
the standard input; it reads the filename directly. However, if no filename is
given together with the command, Unix/Linux commands read the required
input from stdin. Do you remember? We took advantage of this trick in Sec-
tion 6.1 on page 64. In Table 7.1 you find a summary of the syntax for redi-
rections. You should be aware that different shells use different syntax!
Table 7.1. Common Standard Input/Output Redirections for the C Shell and
Bourne Shell. The numbers “1” and “2” stand for the standard output and the
standard error, respectively
Function csh sh
Send stdout to file prog > file prog > file
Send stderr to file prog 2> file
Send stdout and stderr to file prog >& file prog > file 2>&1
Take stdin from file prog < file prog < file
Append stdout to end of file prog >> file prog >> file
Append stderr to end of file prog 2>> file
Append stdout and stderr to end of file prog >>& file prog >> file 2>&1
Now let us take a look at some examples from the bash shell. In the fol-
lowing list, the numbers “1” and “2” stand for the standard output and the
standard error, respectively. Keep this in mind. This will help you in under-
standing the following examples.
7.6 Pipes 87
You see, there are many ways to redirect the input and output of programs.
Let us finally consider that you wish to redirect the output of a command into
a file and to display it on the screen. Do you have to run the command twice?
Not with Unix and Linux! You can use the command tee. With
ls | tee filename
you see the content of the current directory on the screen and save it in a
file named filename. With the option -a, the result is appended to the file
filename.
7.6 Pipes
Pipes let you use the output of one program as the input of another one. For
example you can combine
ls -l | sort
to sort the content of the current directory. We will use pipes later in con-
junction with sed. For example, in
ls -l | sed s/txt/text/g
the command ls -l is executed and sends (pipes) its output the sed program
that substitutes all occurrences of txt by text. Another common way to use
a pipe is in connection with the grep command (see Sect. 6.1.5 on page 68).
With the line
cat publications | grep AIDS
all lines of the file called publications that contain the word AIDS will be
displayed. Another great pipe you should remember is the combination with
the command less (see Sect. 6.1.3 on page 67). This command displays large
88 7 Using the Shell
texts in a rather nice way: you can scroll through the output. You get back to
the command line after hitting Q . Imagine you have a large number of files
in your home directory and run ls -l. Most probably, the output will not fit
onto the screen. This means that you miss the beginning of the text. In such
cases you should use
ls -l | less
in order to be able to scroll through all the output. The following command
offers an interesting combination and a good example of the power of redirec-
tions and pipes:
The file dir-content will contain the content of the current directory. The
screen output is piped to sort and sorted according to file size (the option +4
points to column 4). The result is saved in the file dir-content-size-sorted.
command1 || command2
Command2 is executed if, and only if, command1 returns a non-zero exit status
(that is an error). Let us take a look at one example.
Terminal 35: Command List
1 $ find ~ -name "seq" && date
2 /home/Freddy/seq
3 /home/Freddy/neu/seq
4 /home/Freddy/neu/temp/seq
5 Mit Mai 28 21:49:25 CEST 2003
6 $ find ~ -name "seq" || date
7 /home/Freddy/seq
8 /home/Freddy/neu/seq
9 /home/Freddy/neu/temp/seq
10 $
7.9 Scheduling Commands 89
7.8 Aliases
A very nice feature is the alias function. It allows you to create shortcuts
(aliases) for every command or combination of commands. Let us assume you
have pretty full directories. The command ls -l fills up the screen and you
even lose a lot of lines because the screen can display only a limited number
of lines. Thus, you wish to pipe the output of ls -l to the command less.
Instead of typing every time ls -l | less you can create an alias. The alias
is created with the command alias:
When you now type lls the command ls -l | less is executed. You can
also use your alias in a new alias:
sorts the output according to file sizes, beginning with the largest file. If you
want to see a list of all your active aliases type alias. To remove an alias use
the command unalias, like unalias llss.
Aliases are not saved permanently if you do not explicitly tell the system to
do so. Furthermore, they are valid only in the active shell. If you open a new
shell, you have no access to your aliases. In order to save your alias you must
place an entry into the file .bashrc in your home directory (use the editor vi).
Since different systems behave differently, you are on the safe side by putting
the same entry into the .bash profile in your home directory (see Sect. A.3.1 on
page 266).
Much more powerful than aliases are shell scripts. With shell scripts you can
also use parameters and perform much more sophisticated tasks. You will
learn more about shell scripts in Chapter 8 on page 97.
These background system programs are called daemons. Among these dae-
mons is cron. Cron checks up a system table for entries. These entries tell
cron when to do what. Note: On most systems, you must get permission from
the system administrator before you can submit job requests to cron. In or-
der to fill the table with commands that need to be executed repeatedly (e.g.
hourly, daily or weekly), there is a program called crontab. The crontab
command creates a crontab file containing commands and instructions for the
cron daemon to execute. For editing the crontab file crontab uses the text
editor vi (see Sect. 6.3 on page 72).
You can use the crontab command with a number of options. The most im-
portant ones are:
crontab -e Edit your crontab file. If not existent, create the crontab
file.
crontab -l Display your crontab file.
crontab -r Remove your crontab file.
Each entry in a crontab file consists of six fields, specifying in the following
order: minute, hour, day, month, weekday and command(s). The fields are
separated by spaces or tabs. The first five fields are integer patterns and the
sixth is the command to execute. Table 7.2 briefly describes each of the fields.
Each of the values from the first five fields in Table 7.2 may be either
an asterisk (*), meaning all legal values, or a list of elements separated by
commas. An element is either a number or an inclusive range, indicated by
two numbers separated by a dash (e.g. 10–12). You can specify days with two
fields: day of the month and day of the week. If you specify both of them as
a list of elements, cron will observe both of them. For example
7.10 Wildcards 91
0 0 1,15 * 1 /mydir/myprogram
would run the program myprogram in the mydir directory at midnight on the
1st and 15th of each month and on every Monday. To specify days by only
one field, the other field should be set to *. For example, with
0 0 * * 1 /mydir/myprogram
7.10 Wildcards
You do not have to be a programmer to appreciate the value of the shell’s
wildcards. These wildcards make it much easier to search for files or content.
When the shell expands a wildcard, it is replaced with all the matching pos-
sibilities. Let us take a look at the most common wildcards:
These are the most important wildcards. Let us see how we can use them
in conjunction with the command ls.
Terminal 36: Wildcards
1 $ ls
2 1file cat list.sh new seq2.txt spaces.sh
3 AFILE err.txt list.txt seq1.txt seq5.txt end.txt
4 $ ls [n]*
5 new
92 7 Using the Shell
6 $ ls seq[0-4].txt
7 seq1.txt seq2.txt
8 $ ls [A-Z]*
9 AFILE
10 $ ls *[!a-z]*
11 1file err.txt list.txt seq2.txt spaces.sh
12 AFILE list.sh seq1.txt seq5.txt end.txt
13 $
7.11 Processes
In Linux, a running program or command is called a process. Ultimately, ev-
erything that requires the function of the processor is a process. The shell
itself is a program, too, and thus a process. Whenever one program starts
another program, a new process is initiated. We are then talking about the
parent process and the child process, respectively. In the case that a command
is executed from the shell, the parent process is the shell, whereas the child
process is the command you execute from the shell. That in turn means that,
when you execute a command, there are two processes running in parallel:
the parent process and the child process. As we heard before in Chapter 2 on
page 9, this is exactly one strength of Unix/Linux. The capacity of an op-
erating system, such as Unix or Linux, to run processes in parallel is called
multitasking. In fact, with Unix or Linux you can run dozens of processes in
parallel. The operating system takes care of distributing the calculation ca-
pacity of the processor to the processes. This is like having a nice toy in a
family with several children. There is only one toy (processor). The mother
(operating system) has to take care that every kid (process) can play with the
toy for a certain time before it is passed to the next child. In a computer this
“passing the toy” takes place within milliseconds.
As a consequence of multitasking, it appears to the user as if all processes run
in parallel. Even though you might run only a few processes, the computer
might run hundreds of processes in the background. Many processes are run-
ning that belong to the operating system itself. In addition, there might be
7.11 Processes 93
other users logged in to the system you are working on. In order to unam-
biguously name a process, they are numbered by the operating system. Each
process has its own unique process number.
Terminal 37, lines 1 to 4 show the output of the ps command (print pro-
cess status). The first line is the header describing what the row entries mean:
As usual, there are a lot more options possible. Quite useful is the option
-T. It shows you all processes of the terminal in which you are currently work-
ing together with the process state. However, the most important information
for you are the process ID and name. In the example in Terminal 37 there
are two processes running. The first process is the bash shell and the second
process the command ps itself. They have the process IDs 958 and 1072, re-
spectively. The output of ps -T is shown in lines 6 to 8. Of course, the shell
is still running together with ps -T. The process’s state information can be
interesting to identify a stopped (halted) process.
94 7 Using the Shell
There are two basic ways of running processes: either in the foreground or
in the background. In foreground mode, the parent process (usually the shell)
is suspended while the child (the command or script) process is running and
taking over the keyboard and monitor. After the child process has terminated,
the parent process resumes from where it was suspended.
In background mode, the parent process continues to run using the keyboard
and monitor as before, while the child process runs in the background. In this
case it is advisable that the child process gets all its input from and sends all
its output to files, instead of the keyboard and monitor. Otherwise, it might
lead to confusion with the parent process’s input and output. When the child
process terminates, either normally or by user intervention, the event has no
effect on the parent process, though the user is informed by a message sent
to the display.
An example would be the command sort. For long text files execution of sort
takes very long. Thus, you might want to start it directly in the background.
That is done as shown in Terminal 38.
Terminal 38: Background Program Execution
1 $ sort largefile.txt > result.txt &
2 [3] 23001
3 ...
4 $ [3]- Done sort largefile.txt >result.txt
5 $
Executing any command with the ampersand (&) will start that command
in the background. The sort command is started in the background as shown
in line 1 of Terminal 38. You will then see a message giving the background
ID ([3]) and the process ID (23001). (On many systems you have to press
Enter before you see the message. This can be changed for the current session
by typing set -b). Now you can continue to work in the shell. When the
background process is finished you will see a message as shown in line 4.
You may also elect to place a process already running in the foreground into
the background and resume working in the shell. This is done by pressing
+Z and then typing bg. This brings the process into the background.
Ctrl
In order to bring a background process back into the foreground, you have
to type fg processname or fg backgroundID. You can again send it back
into the background by stopping (not quitting) the process with Ctrl
+Z and
then typing bg. Note that you quit a process with Ctrl +C
.
Terminal 39: Background Processes
1 $ ps -x
2 PID TTY STAT TIME COMMAND
3 1050 pts/1 S 0:00 -bash
4 23036 pts/1 R 0:00 ps -x
5 $ sleep 90 &
7.11 Processes 95
6 [1] 23037
7 $ ps -x
8 PID TTY STAT TIME COMMAND
9 1050 pts/1 S 0:00 -bash
10 23037 pts/1 S 0:00 sleep 90
11 23038 pts/1 R 0:00 ps -x
12 $ fg sleep
13 sleep 90
14
A good command to play around with is sleep 90. This command just
waits 90 seconds. Of course, you can use any other desired amount of seconds.
Terminal 39 shows an example. In line 1 we list all active processes. In line 6,
we execute sleep 90 as a background process. In line 12 we bring sleep to the
foreground with the command fg. Then, which is not visible in Terminal 39, we
press Ctrl
+Z
, thereby pausing (stopping) the process. With the bg command
we resume execution of sleep in line 21.
You may actively terminate (kill) any process, provided you own it (you own
any process that you initiate plus all descendants (children) of that process).
You can also downgrade (but not upgrade!) the priority of any owned process.
Sometimes a process, like a program, hangs. This means you cannot access
it any more and it does not react to any key strokes. In that case you can
kill the process. This is done with the command kill PID, where PID is the
process ID you identify with ps. When you kill a process, you also kill all its
child processes. In order to kill a process you must either own it or be the
superuser of the system.
Terminal 40: Process Killing
1 $ sleep 60 &
2 [1] 1009
3 $ ps
4 PID TTY TIME CMD
5 958 pts/0 00:00:00 bash
6 1009 pts/0 00:00:00 sleep
7 1010 pts/0 00:00:00 ps
96 7 Using the Shell
8 $ kill 1009
9 [1]+ Terminated sleep 60
10 $ ps
11 PID TTY TIME CMD
12 958 pts/0 00:00:00 bash
13 1011 pts/0 00:00:00 ps
14 $
Exercises
If you have read this chapter carefully, you should have no problems with the
two little exercises...
7.2. Print out a list of the active processes, sort the list by command names
and save it into a file.
8
Shell Programming
With this chapter we enter a new world. Until now you have learned the
basics of Unix/Linux. You have learned how to work with files and create
and edit text files. Now we are going to use Linux. We take advantage of its
power. Up to now everything was quite uncomfortable and I guess that you
thought occasionally: “Okay, it is free – but damn uncomfortable!” But now
you are a pro! You know what it is about, how to work on the system; now
it is time to take advantage of it, squeeze it out, form it, make it working
for you, harvest the fruits of learning – by more learning. In this section you
start programming! If you thought programming is something for the freaks –
forget it. Everybody can do it, but you must like solving problems. Hey, you
are a scientist! That is your possession! And it is creative. It is like art – you
will see! It is like solving crosswords: you need to take your time and passion
and probably need to look up one or the other thing; but finally you solved
it.
Throughout the last chapters we used the shell intensively. Now it is time to
take a closer look at its functions.
started. Now your script is running twice; but the creation of new instances
goes on and on and on until your system is completely busy with date and
gives an error message. Most probably you have to restart your system (or
the system administrator will have to do this). Always use file extensions for
script files.
What does the inside of a script file look like? Well, it is simply a file contain-
ing commands to be executed. In this section we will write shell script files;
however, later we will write sed, awk and perl script files. Remember: sed,
awk and perl are script languages.
In the first line of a shell script file you identify the program that is to in-
terpret the script. For example, in order to invoke the bash shell to interpret
the script, the first line would look like #!/bin/bash. This line consists of
the so-called shebang “#!” (from sharp and bang). In the next lines the script
follows. You can and should make use of the possibility to comment your pro-
gram. Otherwise, you will soon forget what your script does and why. Ergo:
always use comments! Comments begin with a hash (#). Everything after the
hash is ignored by the shell.
Okay, let us write our first script. Script files must be executable. I guess this
is clear?! A script is a program – and programs have to be executed. Our first
script (Program 2) will convert all files ending with .sh in the home directory
and all subdirectories into executable files.
Using the text editor vi, enter Program 2 and save it in a file called con-
sh-exe.sh. Then make the file executable with
./con-sh-exe.sh
Now, let us go step by step through the program. Line 1 instructs the bash
shell to execute all the commands. In principle, we could also instruct another
shell or any other command interpreter like awk or perl. Lines 2 and 3 contain
a comment on the program. As said before, lines beginning with a hash (#)
are ignored by the command interpreter. The command echo in lines 4, 6 and
8.2 Modifying the Path 99
8 prints out a message to the standard output (stdout), i.e. the screen. The
message is placed between quotation marks (“”). You could also use echo
without anything. That would print a blank line. In line 5 the script runs
the command find. It searches in the home directory (the shortcut is ∼, you
could also write /home/Freddy) for files having the extension .sh. Note that
the filename is enclosed in quotation marks. The output of the command is
printed, as usual, to the stdout. There is no difference whether you run the
command from the command line or a script. In line 7 the command find
is used again. Here the option -exec prevents the output to stdout. Instead,
the files found are directed to the program executed by the option -exec.
The files are provided in the form of empty curled brackets ({ }). Here, the
program executed is chmod with the option u+x (see Sect. 4.4 on page 41). As
said, the curled brackets represent the files found by find and for which the
permission is to be changed by chmod. Note that the command is followed by
“\;”. This is obligatory. In line 9 the find command is used to list all the
files ending with .sh. This is done in order to visualize that the permissions
are set correctly. If it comes down to the main task: change the permission of
all files ending with .sh to executable; the script could be minimized to lines
1 and 7. All the rest is luxury but helps to understand what is going on.
. ../.bash profile
Another option is to write a script that adds the current working directory to
the PATH variable. Program 3 shows such a script.
Program 3: Add to Path
1 #!\bin\bash
2 # save as add2path.sh
3 # add the currently active directory to the path variable
4 # assignment only active in the current session
5 PATH=$PATH":"$(pwd)
8.3 Variables
A variable is a symbol that stands for some value – an abstraction for some-
thing more concrete. The shell’s ordinary variables are very simple. A variable
comes into existence when a value is assigned to it. Its value is what program-
mers call a string: simple text (it can be composed of numeric digits that
would represent a number to the human observer or to some other program,
but the shell itself is blind to their numeric value).
The assignment of a variable is a matter of only one line:
plant=orchid
There must be no spaces around the equal character. With this simple com-
mand the value “orchid” is assigned to the variable named plant. If you wish
to recall the content saved in plant, you call it with the echo command:
echo $plant
Note that you have to precede the variable name with the dollar character ($).
Terminal 42: Variables
1 $ plant=orchid
2 $ echo $plant
3 orchid
4 $ orchid=parasite
5 $ plant=$orchid
6 $ echo $plant
7 parasite
8.3 Variables 101
12 $
5 sh$ exit
6 exit
7 $ export plant
8 $ sh
9 sh$ echo $plant
10 rose
11 sh$ exit
12 exit
13 $
In line 1 in Terminal 43 we create the variable plant and assign the value
“rose” to it. Then we change into the Bourne shell with the command sh.
Here, as depicted in line 3, we recall the value of plant. It is empty. Now we
exit the Bourne shell, export the variable plant with export plant and go
back into the Bourne shell.
Apart from the variables we create, there are a number of system variables
we can use. These variables are also called environmental variables or shell
variables. Environmental variables are available for the whole system and all
shells. Shell variables are available only in the current shell. In Terminal 43
we exported a shell variable to the environment. With the command env
102 8 Shell Programming
(environment) you can list all available environmental variables. With the
set command you can list all shell variables. Some interesting environmental
variables are shown in Table 8.1.
The list shows the variable name (note that system variables are always
uppercase), its meaning and an example of its content.
In order to delete a variable, the command unset can be used. Then the vari-
able is totally gone and not just empty.
Terminal 44: Variables
1 $ my_var=value
2 $ set|grep my_var
3 my_var=value
4 $ env|grep my_var
5 $ export my_var
6 $ env|grep my_var
7 my_var=value
8 $ unset my_var
9 $ env|grep my_var
10 $ set|grep my_var
11 $
The example shown in Terminal 44 shows how you can easily search for
a variable in the shell or environment with the help of grep. In line 8 the
variable my var is removed from the system.
8.4.1 echo
As we saw before, the echo command prints out text or the value of vari-
ables onto the screen. You should always enclose the text in double quotes, as
shown in Terminal 42 on page 101 line 8. echo offers the interesting options
-e, which makes echo evaluating the character behind a backslash:
When you type echo -e "\a" you will hear the system bell. This function
is quite nice to inform the user that the execution of a script has been finished.
In a similar manner you can introduce tabulators or a new line. Thus, it is
possible to format the output a little bit.
Assume you want to write a couple of help lines in your script. Of course, you
could achieve this with the echo command. However, it looks quite messy.
For such cases the shell offers the << operator, called here document. The
operator is followed by any arbitrary string. All following text is regarded as
coming from the standard input, until the arbitrary string appears a second
time.
Program 4: Printing Text
1 #!/bin/bash
2 # save as text.sh
3 # printing text
4 cat <<%%
5 Here comes a lot of text.
6 My home directory is $HOME - cool.
7 Let us use some tabulators:
8 tab1
9 tab2
10 This looks like a nice list.
11
12 %%
13
14 cat <<\%%
15 Here comes a lot of text.
16 My home directory is $HOME - cool.
17 Let us use some tabulators:
18 tab1
19 tab2
20 This looks like a nice list.
21 %%
104 8 Shell Programming
Program 4 illustrates the use of <<. In line 4 %% has been used as text range
indicator. The cat command gets all the text up to the second occurrence of
%% in line 12. From line 14 to 21 we use the same construction. However, the
operator << is now preceded by a backslash. Therefore the variable HOME
will not be expanded but printed as it is: “$HOME”.
Two commands are available to get interactive input into your shell script.
The most commonly used command is read. On some very old systems the
read command is not available. Then you have to use line instead.
Usually you would first ask the user for the required input. This can easily
be done with echo. Then you apply the command read in order to save the
user’s input in a variable.
Program 5: Change $PWD
1 #!/bin/bash
2 # save as chg-pwd.sh
3 # changed the environmental variable PWD
4 echo
5 echo "--------------------------------"
6 echo "PWD is currently set as $PWD"
7 echo
8 echo "Enter a new path and press ENTER"
9 echo "or enter nothing and press ENTER"
10 echo "to leave \$PWD unchanged"
11 read new
12 PWD=${new:-$PWD}
13 echo "PWD is now set to $PWD"
14 echo "--------------------------------"
15 echo -e "\a"
. chg-pwd.sh
8.4 Input and Output 105
The dot stands for the shell command source. In fact, you could also run the
script with source chg-pwd.sh. The source command prevents the script
from running in a new sub shell.
What make scripts very powerful when compared to aliases is the possibility
to provide parameters. When a shell script is invoked, it receives a list of
parameters (also called arguments) from the command line that invoked it.
These parameters can then be used to direct the execution of the script. Let
us assume you have DNA sequences saved in many individual files. Now, you
want to display all files in your home directory and all subdirectories that
contain a certain DNA sequence. The shell command would be
assuming that you wish to query the sequence TATAAT in files ending with
.dna. It is quite a pain always to enter this line. You could write an alias
for this command; but do you always query for the same sequence? Certainly
not! Furthermore, it would be nice to search in other files, too. For example,
it is common to save amino acid sequences in files ending with .aa. A good
option is to write a shell script! Take a look at Program 6.
Program 6: Find Sequence
1 #!/bin/bash -x
2 # save as grep_file_seq.sh
3 # search for a pattern (par2) in *.(par1) files
4 # query home directory and all subdirectories
5 find $HOME -type f -name "*.$1" -exec grep "$2" {} \;
This means you first enter the script name and then provide two parameters,
dna and TATAAT, respectively. These parameters are internally saved in the
variables $1 and $2, respectively. Line 1 of Program 6 indicates that it should
be interpreted by the bash shell. We also use the option -x. This is a good
option for debugging purposes, that is: testing the program. With the option
-x all executed commands will be displayed (see Sect. 8.8 on page 119). The
next 3 lines contain comments. They help us to remember what the script is
about. Line 5 contains the command itself. Here, you see the use of the system
variable HOME and the parameters $1 and $2, respectively. They are used
by script as they would be typed in. You can use up to 9 parameters. The
content of these parameters is explained in Table 8.2 on the following page.
In Program 2 on page 98 we changed the permission of all files in the home
directory and all subdirectories having the extension *.sh. Now, let us write a
106 8 Shell Programming
script that executes the same task, but only in the currently active directory
and only with files we specify. The corresponding script is shown in Program 7.
Program 7: Make Files Executable
1 #!/bin/bash
2 # save as chmod_files.sh
3 # adds execution permission for the user
4 chmod u+x $*
Program 7 gives you an example of how one can use the variable containing
all parameters ($* ) in order to supply a command with all parameters at once.
There are many more possibilities, which you might want to look up in
the manpages. Let us take a look at some examples in the following Terminal.
Terminal 45: Parameter Substitution
1 $ enzyme="hydrogenase specific endopeptidase"
2 $ echo $enzyme
3 hydrogenase specific endopeptidase
4 $ echo ${enzyme:=text}
5 hydrogenase specific endopeptidase
6 $ echo ${enzyme:+text}
7 text
8 $ echo $gene
9
14 $ echo ${gene:=ATG...TAA}
108 8 Shell Programming
15 ATG...TAA
16 $ echo $gene
17 ATG...TAA
18 $ echo ${#gene}
19 9
20 $ echo ${gene:3:3}
21 ...
22 $ echo ${gene:3}
23 ...TAA
24 $
The examples given in Terminal 45 should give you an insight into func-
tional aspects of variable substitution. In line 1 we assign the value “hydro-
genase specific endopeptidase” to the variable named enzyme. In line 4, the
value of the variable enzyme would be replaced by the text string “text”, if
enzyme were empty, which is not the case. The value of enzyme is returned
and displayed in line 5. In line 6, the content of the variable enzyme is re-
placed because it is not empty. In line 10 of Terminal 45 the text string “gene
not discovered” is returned. However, the empty variable gene remains un-
changed. In line 14 a value is assigned to the previously empty variable gene.
In line 18 we check for the size of the content of gene, whereas we extract
parts of gene in lines 20 and 22. In all these cases the value of gene itself
remains unchanged.
8.6 Quoting
The backslash (\) is bash’s escape character. It preserves the literal value of
the next character that follows, with the exception of the newline character.
Terminal 47: Escaping
1 $ echo "Research costs some $s"
2 Research costs some
3 $ echo "Research costs some \$s"
4 Research costs some $s
5 $
Enclosing characters in single quotes (’) preserves the literal value of all char-
acters within the quotes. Note: A single quote may not occur between single
quotes, even when preceded by a backslash!
Terminal 48: Quoting
1 $ echo ’Research costs some $s’
2 Research costs some $s
3 $ echo ’Research costs some \$s’
4 Research costs some \$s
5 $
The effect of single quotes is demonstrated in Terminal 48. Note that even
the backslash is recognized as literal and not interpreted as escape character.
Enclosing characters in double quotes (”) preserves the literal value of all
characters within the quotes, with the exception of the dollar character ($),
the grave character (`) and the backslash (\). The dollars character and grave
retain their special meaning within double quotes. The backslash retains its
110 8 Shell Programming
8.7.1 if...then...elif...else...fi
Fig. 8.1. The if...then construct. If the expression (expr ) is true, then execute
the command(s) (cmds). Typical expressions are discussed in Section 8.7.2 on the
next page
The elif and else commands are optional. The rest is obligatory. If ex-
pression1 returns 0 (i.e., true), then action1 is executed. If expression1 does
not return 0 (i.e. false), then expression2 is tested. If expression 2 returns 0,
then action2 is executed. Else, action3 is executed. This means that either
action1, action2 or action3 is executed. Of course, each action could consist
of several actions in separate lines.
Usually commands return the exit status 0 when they have finished their job
without any error. This property is used in Program 8.
Program 8: If...Then
1 #!/bin/bash
2 # save as if-ls.sh
3 # tests if file is present in current dir
4 # needs 1 parameter
5 if ls $1 >/dev/null 2>&1; then
6 echo "$1 exists"
7 else
8 echo "$1 does not exist"
9 fi
8.7.2 test
The test command, although not part of the shell, is intended for use by shell
programs. It can be used for comparisons. For example, “test -f file” re-
112 8 Shell Programming
turns the exit status zero if file exists and a non-zero exit status otherwise.
The exit status of the test command (or any other command) can then be
analyzed and the behaviour of the program be adapted accordingly. In gen-
eral, test evaluates a predicate and returns the result as its exit status. Some
of the more frequently used test arguments are given below. Note that n1 and
n2 represent different numbers or variables containing numbers, and s1 and
s2 represent different text strings or variables containing text.
–Numbers–
n1 -eq n2 True if number n1 is equal to number n2.
n1 -le n2 True if number n1 is less than or equal to number n2.
n1 -ge n2 True if number n1 is greater than or equal to number n2.
n1 -lt n2 True if number n1 is less than number n2.
ni -gt n2 True if number n1 is greater than number n2.
–Strings–
-n s1 True if string s1 is not empty.
-z s1 True if string s1 is empty.
s1 = s2 True if string s1 is equal to string s2.
s1 != s2 True if string s1 is not equal to string s2.
–Files–
-z file True if file exists.
-f file True if file is a file.
-d file True if file is a directory.
-r file True if file is readable.
-w file True if file is writable.
-x file True if file is executable.
-s file True if file is not empty.
With the help of echo you can analyze the result of the test command
in the command line.
Terminal 49: Test
1 $ test 2 -eq 2; echo $?
2 0
3 $ test 2 -eq 3; echo $?
4 1
5 $ test $USER; echo $?
6 0
7 $
The exit status of test is saved in the variable $?. Thus, as shown in
Terminal 49, you can check the exit status with “echo $?”. In order to write
two commands on one line we have to separate them with a semicolon. If the
comparison by test returns true (exit status 0), then $? is zero, and vice
versa. The following program gives an example of the combination of if...then
8.7 Decisions – Flow Control 113
and test.
Program 9: Test Parameters
1 #!/bin/bash
2 # save as test-par.sh
3 # tests if the number of parameters is correct
4 if test $# -eq 1; then
5 echo "Program needs exactly 1 parameter"
6 echo "bye bye"
7 exit 1
8 fi
if [ $# -eq 1 ]; then
8.7.3 while...do...done
Fig. 8.2. The while...do construct. While the expression (expr ) returns true, the
command(s) (cmds) are executed
The loop is aborted when an expression returns the exit status 0 (i.e. the
expression becomes false). The loop is introduced with while and ends with
done.
114 8 Shell Programming
Program 10 expects a DNA sequence from the command line. Thus, the
script is executed with
./triplet.sh atgctagtcgtagctagctcga
The DNA sequence is then split into triplets and printed out. In line 4 we
assign the value 0 to the variable x. The important line is line 5. Here we
use the brackets to invoke the test command (see Sect. 8.7.2 on page 111).
We have learned that the option -n will return the exit status 0 when a
given string is not empty (see Sect. 8.7.2 on page 111). This means that the
expression
[-n"${1:$x:3}"]
is true, as long as triplets can be copied from the variable $1 (the command
line parameter). Remember that ${a:b:c} gives c characters from position
b of the variable a (see Sect. 8.5.1 on page 107). Thus, line 5 reads: while
$1:$x:3 is not empty, do execute the commands up to done. In line 6, a
triplet and a space character are added to the variable seq. Then, the counter
variable x is increased by 3 (3 nucleotides = one triplet). Finally, the sequence
is displayed.
Together with the command shift, the while loop can be used to read all com-
mand line parameters. With shift the command line parameters are shifted.
The parameter assigned to $9 is shifted to $8, $8 to $7 and so on. The pa-
rameter in variable $1 is lost. An appropriate way to read all command line
parameters would be:
Program 11: Read Command Line Parameters
1 #!/bin/bash
2 # save as para.sh
3 # prints command line parameters
8.7 Decisions – Flow Control 115
4 while [ -n "$1" ]; do
5 echo "\$#=$# - \$0= $0 - \$1=$1"
6 shift
7 done
8.7.4 until...do...done
Very similar to the while loop is the until loop. In fact, it is the negation of
the former (see Fig. 8.3 on the following page).
An action is not executed while a certain condition is given but until a
certain condition is given. The syntax is:
until expression; do
action(s)
done
Fig. 8.3. The until...do construct. The command(s) (cmds) are executed until
the expression (expr ) becomes true
there are still nucleotides, Program 13 splits the sequence until there are no
nucleotides left in variable $1.
Program 13: Until...Done
1 #!/bin/bash
2 # save as triplet-until.sh
3 # splits a sequence into triplets
4 x=0
5 until [ -z "${1:$x:3}" ]; do
6 seq=$seq${1:$x:3}" "
7 x=$(expr $x + 3)
8 done
9 echo "$seq"
Note the different option for test. Here the option is -z (returns 0 if the
string is empty), in Program 10 on page 114 it was -n (returns 0 if the string
is not empty).
8.7.5 for...in...do...done
Another useful loop is the for...do construct. This is also known as for loop.
In contrast to the while loop, the for loop is not dependent on the exit status
of an expression but on a list of variables. The syntax is:
for variable in list; do -"in list" is optional
action(s)
done
Let us take a look at a very simple example. I even have the impression
that this is the shortest loop script we will write.
Program 14: For Loop
1 #!/bin/bash
2 # save as for1.sh
3 # demonstrates for construct
4 for i in one two three; do
5 echo "$i"
6 done
8.7 Decisions – Flow Control 117
If the for loop is executed without any list, then the command line pa-
rameters are successively called. Thus, Program 11 on page 115 can be run as
shown in Program 16.
Program 16: For Loop
1 #!/bin/bash
2 # save as for-par.sh
3 # gets command line parameters
4 for i do
5 echo "\$#=$# - \$0=$0 - \$i=$i"
6 done
Program 16 does not make use of the shift command. The price is that the
parameter count $# does not decrease its value. In other words, all parameters
are kept in their corresponding positional variable $i.
8.7.6 case...in...esac
As the name implies, you can distinguish different cases with case. This is
especially helpful when you want to create small user interfaces to guide the
user through your program. The case construct is initiated with case and
ends with esac (the reverse of case). The syntax is:
case string in
pattern1)
action(s)1
pattern2)
action(s)2
esac
The string is compared against the patterns. If a pattern matches, the cor-
responding action will be executed. Again, things get clearer with an example.
118 8 Shell Programming
Program 17 asks you for each file in the current directory whether you
want to create a backup or not. The backup is a copy of the file with the
extension .bak. In order to restrict the file selection, the program accepts a file
extension as command line parameter. Thus, if you call the program with
./case-cp.sh .sh
only script files will be listed. In line 5 the files list is created with a for loop.
Line 6 asks whether the current file shall be backuped. Your answer is stored
in answer. Now starts the case discrimination. If answer matches y*, that is
y plus anything, then line 9 is executed. If answer matches n plus anything,
then line 10 is executed and so on. If answer matches neither y*, n*, nor q*,
then line 12 is executed.
You can also connect several patterns with a logical or (|). Thus, j*|y*) would
match yes and the German ja. This can be helpful if you want to create an
international interface for your program.
8.7.7 select...in...do
With line 7 of Program 18 you force the select...done loop to break and
continue the script after the done statement. In our case we just print “bye
bye”.
Like the case construct, the select construct is well suited to build a simple
user interface for your scripts. Such user interfaces are often highly welcome
by the user because they facilitate intuitive usage of your script.
8.8 Debugging
The shell offers you some support for debugging. When you start the shell
with the option -x, it will print each command it is executing. Thereby, you
might identify the malleus line. With the option -v, each line of the script
is printed before execution. Thus, you see the original command without any
command expansion [like $(command)] or parameter substitution.
Program 19: Debugging
1 #!/bin/bash -vx
2 # save as date-vx.sh
3 # demonstrates debugging
4 date
5 lss
6 now=$(date)
7 echo "bye bye"
In the first line of Program 19 we evoke the bash shell with the options -vx.
The commands of the script should be clear. In line 4 we call the command
date, in line 5 the non-existing command lss, in 6 we assign the current date
to the variable now and in line 7 we print bye bye. Now let us take a look at
what we see when we execute the program.
Terminal 51: Debugging
1 $ date-vx.sh
2 #!/bin/bash -vx
3 # save as date-vx.sh
4 # demonstrates debugging
5 date
6 + date
7 Son Mai 25 11:16:44 CEST 2003
8 lss
9 + lss
10 ~/scripts/date-x.sh: line 5 lss: command not found
11 now=$(date)
12 date
13 ++ date
14 + now=Son Mai 25 11:16:44 CEST 2003
15 echo "bye bye"
16 + echo ’bye bye’
17 bye bye
18 $
8.8 Debugging 121
8.8.2 trap
Some signals cause shell scripts to terminate. The most common one is the
interrupt signal Ctrl
+C typed while a script is running. Sometimes a shell
script will need to do some cleanup, such as deleting temporary files, before
exiting. The trap command can be used either to ignore signals or to catch
them to perform special processing. For example, to delete all files called *.tmp
before quitting after an interrupt signal was received, use the command line
trap ’rm *.tmp; exit’ 2
+C
The interrupt signal Ctrl corresponds to signal 2. If this signal is received,
two commands will be executed: rm *.tmp and exit. You can make a shell
script continue to run after logout by letting it ignore the hang up signal
(signal 1). The command
trap ’ ’ 1
allows shell procedures to continue after a hang up (logout) signal.
Program 21: Trap Signal
1 #!/bin/bash
2 # save as trap.sh
3 # catch exit signal
4 while true; do
5 echo "test"
6 trap ’echo "bye bye"; exit’ 2
7 done
122 8 Shell Programming
Be careful when you enter Program 21. while always gets the result true.
Thus, we initiate an endless loop. The screen will fill up with the message test.
However, when you press Ctrl +C
, line 6 is executed. It traps the termination
signal, prints bye bye and exits the script. You should be very careful when
you use trap. It is always a good option to terminate a program. However, if
you misspell, for example, exit, then your program will hang.
8.9 Examples
There is no other way to learn than by doing. In order to practise the function
of shell scripts you will find a number of examples in this section. Save these
scripts and execute them. Then start to change things and see how the scripts
behave.
You call the script with the name of the file that you want to analyze:
./dna-test-sh seq.dna
In line 5 of Program 22 there are two tests connected by && (logical and ).
This means that lines 5 and 6 are executed only if both tests return true.
Alternatively, one might want to use the logical or, which is “||”. The first
8.9 Examples 123
test in line 5 checks if the file content is empty; the second test checks if the
file exists. If one of these two conditions is false, the message in line 6 will
be displayed and the program exits. Otherwise, the grep command in line 9
checks if the file given with the parameter $1 contains characters other than
actg. Depending on the result, grep finishes its job with different exit states. If
grep has found other characters than acgt, then the exit status saved in $? is
0, or else it is 1. If the file does not exist, then the exit status of grep is 2. The
exit status is saved in the variable result in line 10. Then the value of result
is tested in a row of if...then...elif...else...fi as described in Section 8.7.1 on
page 110. In line 15 an alternative way to call test is shown: the command
can be replaced by using brackets [ ].
The following program “beeps” the number of hours. When you run the script
at 4 o’clock, you will hear 4 system beeps. You can call the script from the
cron daemon every hour in order to have a time signal.
Program 23: Time Signal
1 #!/bin/bash
2 # save as Time-Signal.sh
3 # gives a time signal every hour when connected to cron
4 time=$(date +%I)
5 count=0
6 while test $count -lt $time; do
7 echo -e "\a"
8 sleep 1 # sleep for one second
9 count=$[$count+1]
10 done
The heart of Program 23 lies in line 4. “$(date +%I)” returns the cur-
rent hour in 12-hour format (01–12). This value is saved in the variable time
and used in the while loop spanning from lines 6 to 10. The beep signal is
generated by the escape sequence “\a” in line 7 (see Sect. 8.4.1 on page 103).
This script asks the user for each file in the active directory whether it should
be added to an archive or not. Finally, all selected files are archived in a file
called archive.
Program 24: Interactively Archive Files
1 #!/bin/bash
2 # save as archive-pwd-i.sh
3 # interactively archive files with tar
4 array=($(ls))
5 count=0
124 8 Shell Programming
18. All files chosen by the user to be archived is provided with the variable
list.
Exercises
The following exercise sounds easier than it is...
8.1. Go through all the programs and examples in this section and play
around. Modify the code and observe the changes.
9
Regular Expressions
I am sure you have more than once used the internet search engine Google. You
have not? Then you probably used another search engine – there are plenty of
them out there in the web space. As with all searches, the problem is finding.
The better you define your search problem, the better your query result will
be. In this chapter you will learn how to query for text patterns. Some defi-
nitions before we start: we are going to use the terms literal, metacharacter,
target string, escape sequence and search pattern. Here comes a definition of
these terms:
As you will learn soon, there are much more powerful queries possible.
MySQL (a database system), Javascript, Java, PHP and many, many more.
If you want to use grep with regular expressions, you must run it with the
option -e. Most Unix/Linux systems have a built-in alias, which is the com-
mand egrep. This means, using egrep is the same as using grep -e.
There is nothing easier than having the wrong regular expression to find a
target string. It really needs some exercise in order to figure out how to make
it right. Before you apply a search pattern for an important task, you should
check it with a small self-made test file. It is important to check if the regular
expression really finds the desired target and that it excludes all non-wanted
targets. This becomes especially important when we start to use regular ex-
pressions for text replacements. Test your regular expression thoroughly. As
you will see, egrep (used in this section) and vi (see Sect. 9.3 on page 139)
are good tools for doing this.
You might ask: what is the content of this file? A protein can be char-
acterized by three important features: its amino acid sequence (i.e. primary
structure), the presence of α-helices and β-sheets as structural building blocks
(secondary structure) and the overall three-dimensional structure (tertiary
structure). With methods like X-ray or NMR spectroscopy one can resolve
the tertiary structure of a protein. This means that to each atom, a position
in space can be assigned. You know that it is the three-dimensional structure
that determines the function of a protein.
All structural information of a protein is saved in a special file format: the
Brookhaven Protein Data Bank File Format. Our example file resembles this
file format. However, the length of fields has been cut down for printing pur-
poses. The first 5 lines of structure.pdb give some background information,
the next 4 lines contain the protein sequence, followed by secondary struc-
ture features (lines 10–15) and the position of the atoms. Generally, the file
content is organized in lines. The content of each line is indicated by the first
word (HEADER, COMPND, SOURCE, AUTHOR,...). Again, please note
that both lines and rows of the original file have been truncated!
In this section you will learn more about different types of regular expressions
and how to apply them. Here, we will work with the grep command that
understands regular expressions, that is egrep. egrep works in the same way
as grep does (see Sect. 6.1.5 on page 68): it searches the input (usually a
file) for a specified text pattern. The input text is treated in lines. Lines that
contain the query pattern are printed to the standard output (the screen).
In contrast to grep, the query pattern used with egrep may contain regular
expressions. Remember: The search pattern must be enclosed in single quotes
(’...’). This prevents the shell from executing substitutions. Our first example
file will be structure.pdb from Section 9.1 on page 128.
A typical way to write the name of an amino acid is the three-letter code.
The amino acids glycine and glutamine are represented by GLY and GLN,
respectively. Thus, they differ by only one character. There are different ways
to query for such differences. The meta characters used in such queries are
called single-character meta characters. These include the following symbols:
9.2 Search Pattern and Examples 131
Okay, let us see how we can apply these meta characters in regular ex-
pressions. Let us assume that we want to match all lines containing GLU and
GLN in structure.pdb. The appropriate command is shown in Terminal 52.
Terminal 52: egrep GL.
1 $ egrep ’GL.’ structure.pdb
2 SEQRES 1 A 162 MET ARG ILE LEU VAL LEU GLY VAL GLY ASN
3 SEQRES 2 A 162 THR ASP GLU ALA ILE GLY VAL ARG ILE VAL
4 SEQRES 3 A 162 GLU GLN ARG TYR ILE LEU PRO ASP TYR VAL
5 SEQRES 4 A 162 ASP GLY GLY THR ALA GLY MET GLU LEU LEU
6 HELIX 1 hel ILE A 18 GLN A 28
7 $ egrep ’GLN’ structure.pdb
8 SEQRES 3 A 162 GLU GLN ARG TYR ILE LEU PRO ASP TYR VAL
9 HELIX 1 hel ILE A 18 GLN A 28
10 $ egrep ’GLY’ structure.pdb
11 SEQRES 1 A 162 MET ARG ILE LEU VAL LEU GLY VAL GLY ASN
12 SEQRES 2 A 162 THR ASP GLU ALA ILE GLY VAL ARG ILE VAL
13 SEQRES 4 A 162 ASP GLY GLY THR ALA GLY MET GLU LEU LEU
14 $
All lines where an amino acid lies between alanine (ALA) and glycine (GLY)
are matched with
In this example the five dots represent any five consecutive characters, in-
cluding the spaces. The command matches exactly one line, which is line 7.
Alternatively, you could match the same pattern with
This example shows that the spaces are treated as normal characters.
The brackets represent a single character from a selection of characters. This
selection is contained within the brackets. In our file three amino acids start
with GL: glycine (GLY), glutamine (GLN) and glutamic acid (GLU). How
can we match these lines containing either GLU or GLY? Now we employ a
selection.
Terminal 53: Brackets
1 $ egrep ’GL[YU]’ structure.pdb
2 SEQRES 1 A 162 MET ARG ILE LEU VAL LEU GLY VAL GLY ASN
3 SEQRES 2 A 162 THR ASP GLU ALA ILE GLY VAL ARG ILE VAL
4 SEQRES 3 A 162 GLU GLN ARG TYR ILE LEU PRO ASP TYR VAL
5 SEQRES 4 A 162 ASP GLY GLY THR ALA GLY MET GLU LEU LEU
6 $
would match every line where the pattern “[0-2T]A” does not match.
Up to now we have used only a single-character pattern. Now let us see how
we can match repetitions of a particular pattern.
9.2 Search Pattern and Examples 133
9.2.2 Quantifiers
The regular expression syntax also provides meta characters which specify the
number of times a particular character should match. Quantifiers do not work
on their own but are combined with the single character-matching patterns
discussed above. Without quantifiers, regular expressions would be rather use-
less. The following list gives you an overview of quantifiers:
It is quite easy to get confused with these expressions, especially with the
first three. The most universal quantifier is the star (*). In combination with
the dot (.) it matches either no character or the whole line. As mentioned
above, quantifiers are not used on their own. They always refer to the preced-
ing character or meta character. Let us look at some examples with a new test
file named sequence.dna. This file contains two arbitrary DNA sequences in
the FASTA format. Sequences in FASTA format have a sequence name pre-
ceded by the “>” character. The following line(s) contain the sequence itself.
sequence.dna
1 >seq1 the first test sequence
2 ATGxxxTAAxxATGxxTAAGACGCTAGCTCAGCATCGACTACGATCCT
3 GATAGCTATGTCGATGCTGATGCATGCATGCGGGGGGATTGAAAAAGG
4 CGTGTGTAGCGTAATATATGCTATAGCATTGGCATTA
5
Now let us match lines having a start codon (ATG) followed by some nu-
cleotides and a stop codon (TAA):
The next regular expression shows all lines containing two or three, but not
more, consecutive As:
134 9 Regular Expressions
The questions mark indicates that the preceding A might be present once or
not present at all in the matching pattern. The [CGT] at the beginning and
end of the regular expression prevents that a row of more than three As is
recognized. This could also have been achieved with [^A], representing any
character but A:
If you want to detect all lines with two or more repeated As use
The plus character indicates that the preceding A should be present at least
one time. By using braces ({ }) you can exactly define the number of desired
repetitions. Thus, to find a repetition of 4 As or more, use
If you want to search for braces you have to escape them with a backslash
(see Sect. 9.2.5 on the next page)!
As you can see from these examples, regular expressions are really powerful
pattern-matching tools.
9.2.3 Grouping
Often it is very helpful to group a certain pattern. Then you have to enclose it
in parentheses: (...). In this way you can easily detect repeats of the sequence
AT :
egrep ’AT(AT)+’ sequence.dna
Note that if we left away the first AT, we would also match single occurrences
of AT, though we are looking for repeats, that is more than one occurrence.
The next example searches in the structure file structure.pdb for repeating
glycines:
egrep ’(GLY){2,}’ structure.pdb
Note that there is a space character after GLY. This is necessary because the
amino acids are separated by spaces! In order to query for parentheses you
must escape them with a preceding backslash (see Sect. 9.2.5 on the facing
page).
9.2.4 Anchors
Often you need to specify the position at which a particular pattern occurs.
This is often referred to as anchoring the pattern:
9.2 Search Pattern and Examples 135
Let us come back to the file structure.pdb. Find each line that begins with
the character H. The correct command is
If you want to match all empty lines plus all lines only containing spaces, you
use
egrep ’^*$’ filename
With this command you see only empty lines on the screen. More probable is
the situation that you want to print all non-empty lines. That can be achieved
with
egrep ’[^]’ filename
Empty lines, or lines containing only space characters, do not match this
regular expression. Again, note that the space character is treated as any
other literal. In regular expressions the space character cannot be used to
separate entries!
Regular expressions will also help us to format the output of the ls command.
How can we list all directories? This task can easily be achieved by piping the
output of “ls -l” to egrep:
ls -l | egrep ’^d’
egrep checks whether the first character of the file attributes is a “d” (see
Sect. 4.2 on page 38). In a similar manner you could list all files that are
readable and writable by all users:
ls -l | egrep ’^.{7}rw’
Here the search pattern requires that the 8th and 9th character of a line equals
“r” and “w”, respectively.
By now, you are probably wondering how you can search for one of the spe-
cial characters (asterisks, periods, slashes and so on). As for the shell, the
136 9 Regular Expressions
answer lies in the use of the escape character, that is the backslash (\). In
order to override the meaning of a special character (in other words, to treat
it as a literal instead of a meta character), we simply put a backslash before
that character. Thus, a backslash followed by any special character is a one-
character regular expression that matches the special character itself. This
combination is called an escape sequence. The special characters are:
9.2.6 Alternation
This construction will give you the same result as is shown in Terminal 52 on
page 131. You can also combine more of these statements as shown in the
following example:
There is one point to keep in mind when you combine even more expressions.
The regular expression “one and|or two” is equal to “(one and)|(or two)”
but not equal to “one (and|or) two”! You are usually on the safe side by
using parentheses!
9.2 Search Pattern and Examples 137
In Sections 9.2.3 on page 134 and 9.2.6 on the facing page you have already
seen the use of parentheses in order to group constructs. However, the con-
structs within parentheses are not only grouped but also memorized in an
internal memory. This means that you can refer to previously found patterns
within your construct. The following meta characters take care of back refer-
encing.
Sometimes you might wish to use these extended search pattern sets. How-
ever, not all programs understand them. You have to try it out.
9.2.9 Priorities
From algebra you know about the priority of arithmetic operators. The opera-
tors of multiplication/division have a larger priority (are evaluated first) than
the operators of addition/substraction (which are evaluated last). There are
similar rules for regular expressions. The order of priority of operators at the
same parentheses level is [ ] (character classes), followed by *, +, ? (closures),
followed by concatenation, followed by | (alternation) and finally followed by
the newline character. Parentheses have the highest priority. Thus, if you are
in doubt, just use parentheses.
9.3 Regular Expressions and vim 139
egrep offers a number of options that help you to fit the output to your needs.
Especially the last option can be very useful. DNA sequences are often
saved in lower- or uppercase. Thus, it is not desired to distinguish between
lower- and uppercase characters. You can tell egrep to ignore the case with
the option -i (ignore).
Exercises
Hey folks – I do not have to tell you any more that you must practice. In the
previous section you learned some basics of programming. In this chapter you
learned the basics of pattern matching. In the coming sections we are going
to fuse our knowledge. However, that is only fun when you are well prepared!
9.1. Go carefully through all the examples in this section and play around.
Modify the code and observe the changes.
9.2. Find a regular expression that matches a line with exactly three space-
separated fields (words).
9.4. Design a search pattern that matches any decimal number (positive or
negative) surrounded by spaces.
9.5. Find a search pattern that matches a nucleotide sequence which begins
with the start codon ATG and ends with the stop codon TAA. The sequence
should be at least 20 nucleotides long.
9.6. Match all lines that contain the word “hydrogenase” but omit all lines
which contain the word “dehydrogenase”.
9.7. A certain class of introns can be recognized by their sequence. The con-
sensus sequences is “GT...TACTAAC...AG”. The three dots represent an un-
known number of nucleotides. Write a search pattern to match these targets.
9.9. List all files in your home directory that are readable by all users. Do the
same for all files in your home directory and all subdirectories.
10
Sed
Why should I learn how to use a simple non-interactive editor when I know
how to use vi? – you might ask. Indeed, this is a valid question in times when
everybody talks about economics. Is it economic to learn sed? Yes! If you want
to bake a cake “from scratch”, you can either mix the dough quickly by hand,
or set up the Braun super mixer for the same job. sed is the “handy” way. It
is quick and has very little memory requirements. This is an advantage if you
work with large files (some Megabytes). sed loads one line into the computer’s
memory (RAM), edits the line and prints the output either onto the screen or
into a file; and sed is damn fast. I know of no other editor that can compete
with sed. sed allows for very advanced editing commands. However, sed is
only comfortable to use with small editing tasks. For more advanced tasks one
might prefer to use awk or perl.
What are typical tasks for sed? Text substitutions! The most important thing
one can do with sed is the substitution of defined text patterns with other
text. Here are some examples: removing HTML tags from files, changing the
decimal markers from commas to points, changing words like colour to color
throughout a text document, removing comment lines from shell scripts, or
formatting text files are only a small number of examples of what can be done
with sed.
For repeatedly occurring tasks you can write sed scripts or include sed in shell
scripts (see Program 25 on page 125 in Section 8.9.4 on page 125).
In Terminal 57 we first create a short text file using cat (see Sect. 6.1 on
page 64). In line 4 you find our first sed editing command. The command itself
is enclosed in single quotes: s/most/few/. It tells sed to substitute the first
10.2 Getting Started 143
s Substitute command
/../../ Slashes as delimiter
most Regular expression pattern string
few Replacement string
In the example in Terminal 58 the option -n tells sed not to print any line
from the input file unless it is explicitly stated by the command p.
Sometimes one likes to apply more than one editing command. That can be
achieved with the option -e.
Terminal 59: Several Editing Commands
1 $ sed -e ’s/most/few/’ -e ’s/few/most/’ -e ’s/white/red/’
2 test.file.txt
3 I think most spring flowers bloom red.
4 Is that caused by gene regulation?
5 $
All commands must be preceded by the option -e. The example in Ter-
minal 59 shows how sed works through the editing commands. After the first
line of text has been read by sed, the first editing command (the one most to
the left) is executed. Then the other editing commands are executed one by
one. This is the reason why the word most remains unchanged: first it is sub-
stituted by few, then few is substituted by most. With more and longer editing
commands that looks quite ugly. You are better off writing the commands into
144 10 Sed
a file and telling sed with the option -f that the editing commands are in a file.
Terminal 60: Editing Command File
1 $ cat>edit.sed
2 s/most/few/
3 s/few/most/
4 s/white/red/
5 $ sed -f edit.sed test.file.txt
6 I think most spring flowers bloom red.
7 Is that caused by gene regulation?
8 $
In Terminal 60 we create a file named edit.sed and enter the desired com-
mands. Note: Now there are no single quotes or e-options! However, each
command must reside in its own line.
A good way to test small sed editing commands is piping a line of text with
echo to sed.
Terminal 61: echo and sed
1 $ echo "Darwin meets Newton" | sed ’s/meets/never met/’
2 Darwin never met Newton
3 $
The pattern space is a kind of working memory where a single text line (also
called record) is held, while the editing commands are applied. Initially, the
pattern space contains a copy of the first line from the input file. If several
editing commands are given, they will be applied one after another to the
10.4 sed Syntax 145
text in the pattern space. This is why we can change back a previous change
as shown in Terminal 59 on page 143 (most → few → most). When all the
instructions have been applied, the current line is moved to the standard
output and the next line from the input file is copied into the pattern space.
Then all editing commands are applied to that line and so forth.
While the pattern space is a memory that contains the current input line, the
hold space is a temporary memory. In fact, you can regard it as the memory
key of a calculator. You can put a copy from the pattern space into the hold
space and recall it later. A group of commands allow you to move text lines
between the pattern and the hold space:
h hold-O Overwrites the hold space with the contents of the pattern
space.
H hold-A Appends a newline character followed by the pattern space
to the hold space.
g get-O Overwrites the pattern space with contents of hold space.
G get-A Appends a newline character followed by the hold space to
the pattern space.
x exchange Swaps the contents of the hold space and the pattern space.
You might like to play around with the hold space once you have gotten
accustomed to sed. In the example in Section 10.6.3 on page 159, the hold
space is used in order to reverse the lines of a file.
This line causes sed to edit the Input.File line by line according to the job
specified by EditCommand. As shown before, if you want to apply many edit
commands sequentially, you have to apply the option -e:
Sometimes you wish to save your edit commands in a file. If you do this, sed
can execute the commands from the file (which we would call a script file).
Assume the script file’s name is EditCommand.File. Then you call sed with
The option -f tells sed to read the editing commands from the file stated
next. Instead of reading an input file, sed can also read the standard input.
This can be useful if you want to pipe the output of one command to sed.
We have already seen one example in conjunction with the echo command
in Terminal 61 on page 144. A more applied example with ls is shown in
Terminal 64 on page 150.
10.4.1 Addresses
As I said above, sed applies its editing commands to each line of the input
file. With addresses you can restrict the range to which editing commands are
applied. The complete syntax for sed becomes
Note that there is no space between the address and the editing command.
There are several ways to describe an address or even a range:
Let us take a look at some examples. We do not edit anything, but just
print the text lines to which the addresses match. Therefore, we run sed with
the option -n. sed’s command to print a line of text onto the screen is p (see
Sect. 10.5.6 on page 155). As an example file let us use the protein structure
file structure.pdb from Section 9.1 on page 128. In order to display the whole
file content we use
sed -n ’p’ structure.pdb
Do you recognize that this command resembles “cat structure.pdb”? What
would happen if we omitted the option -n? Then all lines would be printed
twice! Without option -n sed prints every line by default and additionally
prints all lines it is told to by the print command p – ergo, each line is printed
twice.
Terminal 62: Addresses
1 $ sed -n ’1p’ structure.pdb
2 HEADER Hydrogenase 23-Mar-99 1CFZ
3 $ sed -n ’2p’ structure.pdb
4 COMPND Hydrogenase Maturating Endopeptidase Hybd From
10.4 sed Syntax 147
would select all lines starting with ATOM (/^ATOM/) for editing. The editing
instruction (s/.*/---deleted---/) selects all the content of the selected lines
(.*) and substitutes it with the text “—deleted—”. Try it out!
back referencing (see Sect. 9.2.7 on page 137), whereas braces belong to the
quantifiers (see Sect. 9.2.2 on page 133).
10.5 Commands
Up to now, we have already encountered some vital sed commands. We came
across substitutions with ’s/.../.../’ and know how to explicitly print lines
by the combination of the option -n and the command p. We also learned
that the commands g, G, h, H and x manipulate the pattern space and hold
space. Let us now take a tour through the most important sed commands
and their application. For some of the following examples we are going to use
a new text file, called GeneList.txt.
GeneList.txt
Energy metabolism
Glycolysis
slr0884: glyceraldehyde 3-phosphate dehydrogenase (gap1)
Init: 1147034 Term: 1148098 Length (aa): 354
slr0952: fructose-1,6-bisphosphatase (fbpII)
Init: 2022028 Term: 2023071 Length (aa): 347
Photosynthesis and respiration
CO2 fixation
slr0009: ribulose bisphosphate carboxylase large (rbcL)
Init: 2478414 Term: 2479826 Length (aa): 470
slr0012: ribulose bisphosphate carboxylase small (rbcS)
Init: 2480477 Term: 2480818 Length (aa): 113
Photosystem I
slr0737: photosystem I subunit II (psaD)
Init: 126639 Term: 127064 Length (aa): 141
ssr0390: photosystem I subunit X (psaK)
Init: 156391 Term: 156651 Length (aa): 86
ssr2831: photosystem I subunit IV (psaE)
Init: 1982049 Term: 1982273 Length (aa): 74
Soluble electron carriers
sll0199: plastocyanin (petE)
Init: 2526207 Term: 2525827 Length (aa): 126
sll0248: flavodoxin (isiB)
Init: 1517171 Term: 1516659 Length (aa): 170
sll1796: cytochrome c553 (petJ)
Init: 846328 Term: 845966 Length (aa): 120
ssl0020: ferredoxin I (petF)
Init: 2485183 Term: 2484890 Length (aa): 97
10.5.1 Substitutions
Substitutions are by far the most common, and probably most useful, editing
actions sed is used for. An example of a simple substitution is the transfor-
mation of all decimal markers from commas to points. The basic syntax of
the substitution command is
s/pattern/replacement/flags
The flags can be used in combinations. For example, the flag gp would
make the substitution globally on the line and print the line.
The replacement can make use of back referencing (see Sect. 9.2.7 on page 137).
Furthermore, the ampersand (&) stands for (and is replaced by) the string that
matched the complete regular expression.
Terminal 63: Back References
1 $ sed -n ’s/^ [A-Z].*$/>>>&/p’ GeneList.txt
2 >>> Glycolysis
3 >>> CO2 fixation
4 >>> Photosystem I
5 >>> Soluble electron carriers
6 $ sed -n ’s/^ \([A-Z].*$\)/--- \1 ---/p’ GeneList.txt
7 --- Glycolysis ---
8 --- CO2 fixation ---
9 --- Photosystem I ---
10 --- Soluble electron carriers ---
11 $
150 10 Sed
Now let us assume we want to delete the text parts describing the position
of genes on the chromosome. The correct command would be
Here, we substitute the target string with an empty string. Try it out in order
to see the result.
The next example is a more practical one. In order to see the effect of
the sed command in Terminal 64 you must have subdirectories in your home
directory.
Terminal 64: Pipe to Sed
1 $ ls -l ~ | sed ’s/^d/DIR: /’ | sed ’s/^[^Dt]/ /’
2 total 37472
3 rw-rw-r-- 1 Freddy Freddy 30 May 5 16:29 Datum2
4 rw-rw-r-- 1 Freddy Freddy 57 May 11 19:00 amino2s
5 DIR: rwxrwxr-x 3 Freddy Freddy 4096 May 1 20:04 blast
6 DIR: rwxrwxr-x 3 Freddy Freddy 4096 May 1 20:34 clustal
7 rw-rw-r-- 1 Freddy Freddy 102400 Apr 19 15:55 dat.tar
8 $
Terminal 64 shows how to use sed in order to format the output of the
ls command. By applying the editing commands given in line 1, the presence
of directories immediately jumps to the eye. With “ls -l ∼” we display the
content of the home directory [Remember that the tilde character (∼) is the
shell shortcut for the path of your home directory]. In the first sed editing
command (s/^d/DIR: /) we substitute all occurrences of d at the beginning
of a line (^) with “DIR: ”. The result is piped to a second instance of sed
which introduces spaces at the beginning of each line not starting with a “D”
(as in “DIR”) or “t” (as in “total...”). With our knowledge about writing
10.5 Commands 151
shell scripts and the alias function (see Sect. 7.8 on page 89), we could now
create a new standard output for the ls command.
Program 26: Format ls output
1 $ cat list-dir.sh
2 #!/bin/bash
3 # save as list-DIR.sh
4 # reformats the output of ls -l
5 echo "‘ls -l $1|sed ’s/^d/DIR: /’|sed ’s/^[^Dt]/ /’‘"
10.5.2 Transliterations
You have got a nice sequence file from a colleague. It contains important se-
quence data, which you want to process with a special program. The stupid
thing is that the sequences are lowercase RNA sequences. However, you require
uppercase DNA sequences. Hmmm – what you need is sed’s transliteration
command “y”.
Terminal 65: Transformation
1 $ cat>rna.seq
2 >seq-a
3 acgcguauuuagcgcaugcgaauaucgcuauuacg
4 >seq-b
5 uagcgcuauuagcgcgcuagcuaggaucgaucgcg
6 $ sed ’/>/!y/acgu/ACGT/’ rna.seq
7 >seq-a
8 ACGCGTATTTAGCGCATGCGAATATCGCTATTACG
9 >seq-b
10 TAGCGCTATTAGCGCGCTAGCTAGGATCGATCGCG
11 $
side (a) is transliterated into the first character on the right side (A) and so
on. It always affects the whole line.
Now let us generate the complement of the sequences in the file rna.seq. The
correct comment is
Could you achieve the same result with the substitution command from Sec-
tion 10.5.1 on page 149? If you think so and found a solution, please email it
to me.
10.5.3 Deletions
We have used already the substitution command to delete some text. Now
we will delete a whole line. Assume you want to keep only the classes and
subclasses but not the gene information in the GeneList.txt file. This means
we need to delete these lines. The corresponding command is d (delete), the
syntax is
addressd
The d immediately follows the address. The address can be specified as de-
scribed in Section 10.4.1 on page 146. If no address is supplied, all lines will
be deleted – that sounds not too clever, doesn’t it? The important thing to
remember is that the whole line matching the address is deleted, not just the
match itself. Let us come back to our task: extracting the classes and sub-
classes of GeneList.txt.
Terminal 66: Deletions
1 $ sed ’/ /d’ GeneList.txt
2 Energy metabolism
3 Glycolysis
4 Photosynthesis and respiration
5 CO2 fixation
6 Photosystem I
7 Soluble electron carriers
8 $
Since all gene information lines are indented by 5 space characters, these
5 spaces form a nice address pattern. Thus, the task is solved easily, as shown
in Terminal 66.
Append and insert can only deal with a single line address. However,
for change, the address can define a range. In that case, the whole range is
changed. In contrast to all other commands we have encountered up to now,
a, i and c are multiple-line commands. What does this mean? Well, they
require a line break. For example, the syntax for the append command a is
addressa\
text
As you can see, the command is followed by a backslash without any space
character between the address and the command! That is true for all three
commands. Then follows the text that is to be appended after all lines match-
ing the address. To insert multiple lines of text, each successive line must end
with a backslash, except the very last text line.
Terminal 67: Insertion
1 $ sed ’/>/i\
2 > ---------’ rna.seq
3 ---------
4 >seq-1
5 acgcguauuuagcgcaugcgaauaucgcuauuacg
6 ---------
7 >seq-2
8 uagcgcuauuagcgcgcuagcuaggaucgaucgcg
9 $
5 > -------------------------
6 > ’ GeneList.txt
7 Energy metabolism
8 =========================
9 Glycolysis
10 -------------------------
11 Photosynthesis and respiration
12 =========================
13 CO2 fixation
14 -------------------------
15 Photosystem I
16 -------------------------
17 Soluble electron carriers
18 -------------------------
19 $
Terminal 68 gives you an idea of how several commands can be used with
one call of sed. Line 1 contains the editing commands to delete all lines con-
taining five successive space characters. This matches the gene information
lines of our example file GeneList.txt. The address in line 2 matches all main
classes. A line containing equal characters (=) is appended to this match.
Then, to all lines containing three consecutive space characters, a line with
dashes is appended. The editing action is executed after hitting Enter in line
6. What follows in Terminal 68 is the resulting output.
Let us finally take a quick look at the change command.
Terminal 69: Change
1 $ sed ’/Photo/,$c\
2 > stuff deleted’ GeneList.txt
3 Energy metabolism
4 Glycolysis
5 slr0884: glyceraldehyde 3-P dehydrogenase (gap1)
6 Init: 1147034 Term: 1148098 Length (aa): 354
7 slr0952: fructose-1,6-bisphosphatase (fbpII)
8 Init: 2022028 Term: 2023071 Length (aa): 347
9 stuff deleted
10 $
In the example shown in Terminal 69 all lines between the first occurrence
of “Photo” and the last line is changed to (substituted by) the text “stuff
deleted”.
If you need to apply a number of editing commands several times, you might
prefer to save them in a file.
10.5 Commands 155
The script file containing the commands used in Terminal 68 on the facing
page is shown in Program 27. Lines 1 and 2 contain remarks that are ignored
by sed. You would call this script by
Maybe you are using the script very, very often. Then it makes sense to create
an executable script.
Program 28: Sed Executable
1 sed ’
2 # save as script2.sed
3 # Provide filename at the command line
4 / /d /^[A-Z]/a\
5 =========================
6 / /a\
7 -------------------------
8 ’ $*
./script2.sed GeneList.txt
This requires much less typing, especially if you create an alias for the program
(see Sect. 7.8 on page 89)!
10.5.6 Printing
We have already used the printing command p a couple of times. Still, I will
shortly mention it here. Unless the default output of sed is suppressed (-n),
the print command will cause duplicate copies of the line to be printed. The
print command can be very useful for debugging purposes.
156 10 Sed
Terminal 70 shows how to use p for debugging. In line 1 the current line
in the pattern space is printed. Line 2 modifies the pattern space and prints
it out again. Thus, we see the line before and after editing and might detect
editing errors.
A special case of printing provides the equal character (=). It prints the line
number of the matching line.
Terminal 71: Print Line
1 $ sed -n ’/^[A-Z]/=’ GeneList.txt
2 1
3 7
4 $
The editing command in Terminal 71 prints the line number of lines con-
taining major classes in GeneList.txt.
The read (r) and write (w) commands allow you to work directly with files.
This can be very comfortable when working with script files. Both commands
need a single argument: the name of the file to read from or write to, respec-
tively. There must be a single-space character between the command and the
filename. At the end of the filename, either a new line must start (in script
files) or the script must end with the single quote (as in Terminal 72).
Terminal 72: Write into File
1 $ sed -n ’/^ [A-Z].*/w output.txt’ GeneList.txt
2 $ cat output.txt
3 Glycolysis
4 CO2 fixation
5 Photosystem I
6 Soluble electron carriers
7 $
10.6 Examples
You should keep up the habit of trying out some example scripts. Do some
modification and follow the changes in the output.
Since we started with trees, why not writing a scripting in order to print the
directory content in a tree-like fashion?
Program 29: Directory Tree
1 #!/bin/bash
2 # save as tree.sed
3 # requires path as command line parameter
4 if [ $# -ne 1 ]; then
5 echo "Provide one directory name next time!"
6 echo
7 else
8 find $1 -print 2>/dev/null |
9 sed -e ’s/[^\/]*\//|--- /g’ -e ’s/--- |/ |/g’
10 fi
10.6 Examples 159
What does Program 29 on the facing page do? Well, first you should
recognize that it is a shell script. In line 4 we check whether exactly one
command line parameter has been supplied. If this is not the case, the message
“Provide one directory name next time!” is displayed and the program stops.
Note: It is always a good idea to make a script fool-proof and inform the user
about the correct syntax when he applies the wrong syntax. Otherwise, the
find command is invoked with the directory name provided at the command
line ($1 ). The find option -print prints the full filename, followed by a new
line, on the standard output. Error messages are redirected into the nirvana
(2>/dev/null, see Section 7.5 on page 85). The standard output, however, is
not printed onto the screen but redirected (|) to sed. In order to understand
what sed is doing, I recommend you to look at the output of find and apply
the search pattern “[^\/]*\/” to it. How? Use vi! In Section 6.3.5 on page 75
we saw that we can read external data into an open editing file. If you type
Assume you have a file with lists of parameters in different lines. For reasons
I do not know, you want to reverse the order of the file content. Ahh, I re-
member the reason: in order to exercise!
Terminal 75: Reversing Lines
1 $ cat > parameterfile.txt
2 Parameter 1 = 0.233
3 Parameter 2 = 3.899
4 Parameter 3 = 2.230
5 $ sed ’1!G
6 > h
7 > $!d’ parameterfile.txt
8 Parameter 3 = 2.230
9 Parameter 2 = 3.899
10 Parameter 1 = 0.233
11 $
that is executed is h. This tells sed to copy the contents of the pattern space
(the buffer that holds the current line being worked on) to the hold space
(the temporary buffer). Then, the d command is executed, which deletes the
current line from the pattern space. Next, line 2 is read into the pattern space
and the command G is executed. G appends the contents of the hold space
(the previous line) to the pattern space (the current line). The h command
puts the pattern space, now holding the first two lines in reverse order, back
to the hold space for safe keeping. Again, d deletes the line from the pattern
space so that it is not displayed. Finally, for the last line, the same steps are
repeated, except that the content of the pattern space is not deleted (due to
$!)’ before the ’d’). Thus, the contents of the pattern space is printed to the
standard output.
Note that you could execute the script also with
In this case, the commands are separated by the semicolon instead of the
newline character.
Exercises
Now let us see what you have learned. All exercises are based on the file
structure.pdb shown in Section 9.1 on page 128.
10.5. Print all lines where the HELIX line contains the word ILE.
10.6. Append three stars to the end of lines starting with “H”.
10.8. Create a file with text and blank lines. Then delete all blank lines of
that file.
Part IV
Programming
11
Awk
Awkay, are you ready to learn a real programming language? awk is a text-
editing tool that was developed by Aho, Weinberger and Kernighan in the
late 1970s. Since then, awk has been largely improved and is now a complete
and powerful programming language. The appealing thing for scientists is
that awk is easy enough to be learned quickly and powerful enough to execute
most relevant data analysis and transformation tasks. Thus, awk must not be
awkward. awk can be used to do calculations as well as to edit text. Things
that cannot be done with sed can be done with awk. Things you cannot do
with awk you can do with perl. Things you cannot do with perl you should
ask a computer scientist to do for you.
While you work through this chapter, you will notice that features similar to
those we saw in the chapter on shell programming (see Chap. 8 on page 97)
reappear: variables, flow control, input-output. This is typical. All program-
ming languages have more or less the same ingredients but different syntax.
On the one hand this is pain in the neck, on the other hand it is very con-
venient. Because if you have learned one programming language thoroughly,
then you can basically work with all programming languages. You just need
to cope with the different syntax; and syntax is something you can look up in
a book on the particular programming language. What programming really
is about is logics. First, understand the structure of the problem you want
to solve, or the task you want to execute, then transfer this structure into a
program. This is the joy of programming. This said, if you already know how
to program, then focus on the syntax, or else have fun learning the basics of a
programming language in this chapter. Did you recognize the italic words in
the last sentence? There, we just used the control element if-then-else, which
is part of every programming language.
As you might have guessed, there are several different versions of awk avail-
able. Thus, if you encounter any problems with the exercises in this chapter,
the reason might be an old awk version on your system. It is also possible
that not awk but gawk is installed on your system. gawk is the freeware GNU
164 11 Awk
version of awk. In that case you would have to type gawk instead of awk. You
can download the newest version at rpmseek.com.
In Terminal 76 we first generate a new text file called enzyme.txt. This file
contains a table with three enzymes and their catalytic activity (Km). In line
6 we run an awk statement. The statement itself is enclosed in single quotes.
The input file is given at the end. The statement can be divided into two
parts: a pattern and an action. The action is always enclosed in braces ({ }).
In our example the pattern is “$2 < 1”. You can read this as: “if the value in
the variable $2 is smaller than 1, then do”. The content of variable $2 is the
value of field two. Since the default field separator is the space character, field
2 corresponds to the second column of our table, the Km value. Only in line
4 of Terminal 76 “Hydrolase 0.4” is the Km value smaller than 1. Thus, the
pattern matches this line and the action will be executed. In our example the
action is “{print $1}”. This can be read as: “print the content of variable
$1 ”. The variable $1 contains the value of field 1. In the matching line this
is “Hydrolase”. Therefore, field 1 is printed in line 7 of Terminal 76 on the
11.2 awk’s Syntax 165
facing page.
The first example gave you a first insight into the function of awk. After
starting awk, the lines of the input file are read one by one. If the pattern
matches, then the action will be executed. Otherwise, the next line will be
read. In the following section we learn more about awk’s syntax.
-F"x" Determines the field separator. The default setting is the space
character. If you want to use another field separator you must use
this option and replace x with your field delimiter. x can consist
of several characters.
-f Tells awk to read the commands from the file given after the -f
option (separated by a space character).
166 11 Awk
The default field separator is the space character. You can change the field
separator by supplying it immediately after the -F option, enclosed in double
quotes.
Terminal 77: Option -F
1 $ awk -F":" ’/Freddy/ {print $0}’ /etc/passwd
2 Freddy:x:502:502::/home/Freddy:/bin/bash
3 $ awk -F":" ’/Freddy/ {print $1 " uses " $7}’ /etc/passwd
4 Freddy uses /bin/bash
5 $
(Remember that “∼” is the shell’s shortcut for the path to your home direc-
tory.) Before we use this file for calculations, we perform two modifications.
First, we should erase the three last lines, second, we must remove the comma
11.4 Patterns 167
Terminal 79 gives you a little sed update. We edit all lines not containing a
question mark (/\?/! – note that the question mark must be escaped because
it is a meta character) and substitute globally the comma by nothing (s/,//g).
Then, we print only the edited lines (option -n plus print command p). The
result of the sed statement should be redirected to a file named genomes2.txt.
Now we have a nice file, genomes2.txt, and we can start to learn more details
about awk.
11.4 Patterns
As we have already seen, awk statements consist of a pattern with an as-
sociated action. With awk you can use different kinds of patterns. Patterns
control the execution of actions: only if the currently active line, the record,
matches the pattern, will the action or command be executed. If you omit to
state a pattern, then every line of the input text will be treated by the actions
(default pattern). Let us take a closer look at the available pattern types.
Except for the patterns BEGIN and END, patterns can be combined with the
Boolean operators || (or), && (and) and ! (not).
For clarity, we first print the content of the file enzyme.txt. Then, in line 6
of Terminal 80, the first field ($1 ) of all records (lines) containing the character
“2” is printed. The statement in line 9 inverses the selection: the first field
of all records not containing the character “2” is printed. In our example,
we used the easiest possible kind of regular expressions. Of course, regular
expressions as patterns can be unlimitedly complicated.
Sometimes you will need to ask the question if a regular expression matches
a field. Thus, you do not wish to see whether a pattern matches a record but
a specified field of this record. In this case, you would use a pattern-matching
expression. Pattern-matching expressions use the tilde operator (∼). In the
following list, the variable $n stands for any field variable like $1, $2 or so.
The regular expression must be enclosed in slashes. Let us print all lines
where the first field does not fit the regular expression /ase/.
Terminal 81: Pattern-Matching Expression
1 $ awk ’$1 !~ /ase/’ enzyme.txt
2 Enzyme Km
3 $
Remember that the default action is: print the line. In Terminal 81 we omit
the action and specify only a pattern-matching expression. It reads: if field 1
($1 ) does not match (!∼) the regular expression (/ase/), then the condition
is true. Only if the condition is true will the whole line (record) be printed.
In the next example we check which users have set the bash shell as their
default shell. This information can be read from the system file /etc/passwd.
Terminal 82: Bash Users
1 $ awk -F":" ’$7 ~ /bash/ {print $1}’ /etc/passwd
2 root
3 rpm
4 postgres
11.4 Patterns 169
5 mysql
6 rw
7 guest
8 Freddy
9 $
The awk statement in line 1 of Terminal 82 checks whether the 7th field
of the /etc/passwd file contains the regular expression “bash”, which must
be enclosed in slashes. For all matching records, the first field, that is the
username, will be printed. Note that the field separator is set to the colon
(:) character. Do you prefer to have the output ordered alphabetically? You
should know how to do this! How about
Attention, it is very easy to type “=” instead of “==”! This would not cause
an error message; however, the output of the awk statement would not be what
170 11 Awk
you want it to be. With “=” you would assign a variable. It is also worthwhile
to note that uppercase characters are lexicographically less than lowercase
characters and number characters are less than alphabetic characters.
numbers < uppercase < lowercase
I must admit that the use of the term “apple” in Terminal 83 does not quite
give the impression that we are learning a programming language. Anyway,
we first generate a text file with the words “Apple” and “apple” and the
characters “A” and “a”. In all awk statements we check the relation of the
record ($0 ), that is the complete line, with the character “a” or “A”. We do not
define any action, thus, the default action (print $0) is used. You are welcome
to try out more examples in order to get a feeling for relational character
expressions. You should recognize that, in our example, the lexicographically
lowest word is “A”. On the contrary, the highest is the word “apple”:
11.4 Patterns 171
$n == v Is true if $n is equal to v.
$n != v Is true if $n is not equal to v.
$n < v Is true if $n is less than v.
$n <= v Is true if $n is less than or equal to v.
$n > v Is true if $n is greater than v.
$n >= v Is true if $n is greater than or equal to v.
We have already used relational number expressions in our very first ex-
ample in Terminal 76 on page 164. The use of relational number expressions
is straightforward. In the above list, we have always used the variable $n. Of
course, the expression is much more flexible. Any numerical value is allowed on
the left side of the relation. In the following example we calculate the length
of the field variable with the function length( ).
Terminal 84: Numerical Relation
1 $ awk ’length($1) > 6’ enzyme.txt
2 Protease 2.5
3 Hydrolase 0.4
4 $
of your relation is a string, awk considers both sides to be strings. Thus, the
following statement will lead to a wrong result.
Terminal 85: Mixing
1 $ awk ’$2 > 2’ enzyme.txt
2 Enzyme Km
3 Protease 2.5
4 $
prints the numeric value 15. awk assumes that you want to concatenate the
variables one and two with the statement “one two”, leading to 12. Then awk
assumes that you want to do a calculation and adds 3 to 12. The result is 15.
If you need to force a number to be converted to a string, concatenate that
number with the empty string ””. A string (that contains numerical charac-
ters) can be converted to a number by adding 0 to that string: “2.5” converts
to 2.5, “2e2” converts to 2000, “2.5abc” converts to 2.5 and “abc2.5” con-
verts to 0. Thus, the solution to the problem in Terminal 85 is:
With these short examples I wanted to turn your attention to the problem
of mixing letters and numbers. You should always carefully check your state-
ments! Always!
11.4.6 Ranges
In the example in Terminal 86 all records between and including the first
line matching the regular expression “En” and the first line matching the reg-
ular expression “Hy” are printed. Of course, pattern ranges could also contain
relations.
11.4 Patterns 173
In the example in Terminal 87 we use the file from Section 9.1 on page 128.
Note that we add zero to the value of variable $2. By doing this, we ensure
that we have a numeric value, even though the second field contains a text
string. The example in Terminal 87 shows that the range works like a switch.
Upon the first occurrence of the first pattern (here “$2+0 > 3”), all lines are
treated by the action (here the default action: print record) until the second
pattern (here “$2+0 < 2”) matches or becomes true. As long as the “pattern
switch” is turned on, all lines match. When it becomes switched off, no line
matches, until it becomes turned on again. If the “off switch”, which is the
second pattern, is not found, all records down to the end of the file match.
That happens in Terminal 87 from line 4 on.
If both range patterns are the same, the switch will be turned on and off at
each record. Thus, the statement
BEGIN and END are two special patterns in awk. All patterns described so far
match input records in one or the other way. In contrast, BEGIN and END do
not deal with the input file at all. These patterns allow for initialization and
cleanup actions, respectively. Both BEGIN and END must have actions and these
must be enclosed in braces. There is no default action.
The BEGIN pattern, or BEGIN block, is executed before the first line (record)
of the input file is read. Likewise, the END block is executed after the last
record has been read. Both blocks are executed only once.
Terminal 88: BEGIN and END
1 $ awk ’
2 > BEGIN {FS="-"; print "Species in the file:"}
3 > {print "\t" $1}
4 > END {print "Job finished"}
5 > ’ genomes2.txt
6 Species in the file:
174 11 Awk
7 H. sapiens (human)
8 A. thaliana (plant)
9 S. cerevisiae (yeast)
10 E. coli (bacteria)
11 Job finished
12 $
11.5 Variables
We have learned already in the section about shell scripts how to work with
variables (see Sect. 8.3 on page 100). We saw that a variable stands for some-
thing else. They are a way to store a value at one point in your program and
recall this value later. In awk, we create a variable by assigning a value to
it. This value can be either a text string or a numerical. In contrast to the
shell, variables must not be preceded by a dollar character. Variable names
must start with an alphabetical character and can then contain any character
(including digits). As everything in Linux, variable names are case-sensitive!
There are several ways how to assign a value to a variable in awk. The most
common commands (assignment operators) are stated below. In the list x rep-
resents a variable, y a number or text.
11.5 Variables 175
Interesting and commonly used variable manipulation tools are the increment
(++) and decrement (--) operators.
As described in the list above, the increment (++) and decrement (--)
operators increase or decrease the value of a variable by 1, respectively. In
fact, you could do the same thing with an assignment operator described in
the previous section. Thus, the pre-increment ++x is equivalent to x+=1 and
the post-increment x++ is equivalent to (x+=1)-1. The same holds for --: the
pre-decrement --x is equivalent to x-=1 and the post-decrement x-- is equiv-
alent to (x-=1)-1. The increment and decrement operators are nice, easily
readable shortcuts for the assignment operators. As you might have figured
out, the difference between the pre- and the post-increment (or decrement) is
the return value of the expression. The pre-operator first performs the incre-
ment or decrement and then returns a value, whereas the post-operator first
returns a value and then increments or decrements.
Terminal 90: Increment and Decrement
1 $ awk ’BEGIN{
2 > x=1; y=1; print "x="x, "y="y
3 > print "x++="x++, "++y="++y, "x="x, "y="y}’
4 x=1 y=1
5 x++=1 ++y=2 x=2 y=2
6 $
Terminal 90 shows you how the pre- and post-increment operator performs
in real life. Both increase the value of the variables x and y by 1. However,
the return values of x++ and ++y are different.
Terminal 90 demonstrates also the use of the BEGIN block for small experi-
mental scripts. The complete awk script is written within the BEGIN block.
Thus, we do not need any input file. This is a convenient way to write small
scripts that demonstrate only the function of something without referring to
any input file.
11.5 Variables 177
awk comes with a number of predefined variables that can be used and mod-
ified. While some variables provide you with valuable information about the
input file or record, others allow you to adapt the behaviour of awk to your
own needs.
Positional Variables
We have worked already intensively with positional variables. They are the
only variables in awk that are preceded by the dollar character. The positional
variable $0 contains the whole active record, while $1, $2, $3 and so on,
contain the value of the different fields of the record.
There are a number of very helpful variables which help you to define what
structures awk interprets as records and fields. The best way to assign a value
to any of the variables in the following list is to use a BEGIN block.
Quite commonly used is the variable FS. However, especially in cases where
the output of scientific software is analyzed, the FIELDWIDTHS variable can
178 11 Awk
Some variables provide you with valuable information about the file you are
just working with.
11.5 Variables 179
The variable ARGV is in fact an array (see Sect. 11.5.4 on the next page).
Thus, behind the facade of ARGV is not just one but many entries. You
can access these by using indices. ARGV[0] is the first element of the ar-
ray; it always contains the value “awk”. ARGV[1] contains the first com-
mand line parameter, ARGV[2] the second command line parameter and so
on. The variable ARGC returns the number of used ARGV variables. Thus,
ARGV[ARGC-1] is the last command line parameter (minus 1, because the
first element is 0). Note that command line parameters need to be between
the awk script and the input file(s), separated by space characters. The syntax
is:
awk ’script’ par1 par2 InputFile(s)
ARGV[0] ARGV[1] ARGV[2] ARGC=3
Let us see how we can use command line parameters.
Terminal 94: Command Line Arguments
1 $ awk ’BEGIN{item=ARGV[1]; ARGV[1]=""}
2 > $1 ~ item
3 > ’ rotea enzyme.txt
4 Protease 2.5
5 $
You will receive an error message saying that the file or directory could not
be found. With ARGV[1]="" in line 1 in Terminal 94 we erase the content of
ARGV[1]. Thus, you have to transfer the command line parameters to other
variables in the BEGIN block and then delete the content of all ARGV’s. What
our small script in Terminal 94 does is read one command line parameter and
use it in order to find out whether it is found in the first field of any line of
the input file. If so, the line will be printed out (default action).
Let us summarize what is important to remember: a) The variable ARGV is
an array. The first command line parameter sits in ARGV[1] and so forth.
Transfer the content of ARGV s to new variables inside a BEGIN block. Erase
the content of ARGV s in the BEGIN block. That is it.
Why would we want to use command line parameters? They are helpful when
you write complex awk scripts that are saved as files. Similarly to shell scripts,
you can then use the command line parameters to influence the behaviour of
your script.
11.5 Variables 181
Shell Environment
When you write awk scripts you might want to access some of the shell vari-
ables, like the user’s home directory. For this purpose the special variable
ENVIRON exists. In fact, this variable is an associative array. You need not
know exactly what that is. You will learn more about arrays in Section 11.5.4.
For now it is more important to see how we have to use it. Assume you want
to get the path to your home directory from within a awk script.
Terminal 95: Shell Variables
1 $ awk ’BEGIN{print ENVIRON["HOME"]}’
2 /home/Freddy
3 $
The correct command is shown in Terminal 95. The brackets indicate that
the variable is an array. In the same way you can gain access to all environ-
ment variables. Take a look back into Section 8.3 on page 100 for a small list
of shell variables. You will find a complete list in the manual pages for bash:
“man bash”.
A last hint: you cannot change environmental variables from within awk
scripts. This you can only do with shell scripts.
Others
There are more awk built-in variables available that we are not going to touch
here. As always, you are welcome to take a look at the manpages of awk.
11.5.4 Arrays
In this example the array name is data. As indices we used 1, 2 and input.
Note that strings must be enclosed in double quotes. Arrays can also have
multiple dimensions like
This is a two-dimensional array. The indices for each dimension must be sep-
arated by commas.
As usual, let us take a look at a small example.
Terminal 96: Arrays
1 $ echo "atgccg" | awk ’BEGIN{codon["atg"]="MET"
2 > codon["ccg"]="VAL"; FIELDWIDTHS="3 3"}
3 > {print codon[$1], codon[$2]}’
4 MET VAL
5 $
This function divides a string at the given field separator and stores the re-
sulting pieces in the array named array. The first piece would be stored in
array[1]. The command returns the number of pieces that have been generated
and saved.
Terminal 97: Splitting a String
1 $ awk ’{pieces=split($2,data,".")}
2 > {print pieces, "pieces have been saved"}
3 > {print "Integer: "data[1]"\tDecimal: "data[2]}
4 > ’ enzyme.txt
5 1 pieces have been saved
6 Integer: Km Decimal:
7 2 pieces have been saved
11.5 Variables 183
8 Integer: 2 Decimal: 5
9 2 pieces have been saved
10 Integer: 0 Decimal: 4
11 2 pieces have been saved
12 Integer: 1 Decimal: 2
13 $
There are two ways in which you can refer to an array element. The most
common one is to use the array’s name and an index:
array[index]
index in array
Scanning an Array
This for...in... command generates a loop through all previously used in-
dices of the array array. These indices are accessible via the variable variable.
The reality looks like this:
184 11 Awk
The script in Terminal 98 saves all values of the first and the second field of
lines with a number in the array data. In the END block we use the for...in...
loop to recall all array elements and print out their value (data[x]), together
with the index (x). In line 7 we print out the file enzyme.txt again. If you
compare the output of the awk script with the file content you will recognize
that the order of the output is strange. There is nothing we can do about this.
The order in which the elements of the array are accessed by awk cannot be
influenced. It depends on the internal processing of the data.
delete array
where array is the array name. If you want to delete only a certain element
you can use
delete array[index]
Once an array element has been deleted, it is gone forever. There is no way
to recover it.
Sorting an Array
With the command asort you can sort the elements of an array according to
their value.
Terminal 99: Sorting
1 $ awk ’/[0-9]/ {data[$1]=$2}
2 END{asort(data)
3 for (x in data) print "Index="x, " Value="data[x]}
4 ’ enzyme.txt
5 Index=1 Value=0.4
11.6 Scripts and Executables 185
6 Index=2 Value=1.2
7 Index=3 Value=2.5
8 $
Like shell scripts, you can make awk scripts executable. This is very conve-
nient when you use the scripts a lot and want to create an alias.
Program 31: Executable Script
1 #!/bin/awk -f
2 # save as sort1-exe.awk
3 # sort on value
4 /[0-9]/ {data[$1]=$2}
5 END{asort(data)
6 for (x in data) {print "Index="x, " Value="data[x]}
7 }
./sort1-exe.awk enzyme.txt
This is basically all you need to know in order to create awk scripts.
11.7.1 if...else...
If you read up to this point then read on, or else start with Chapter 11 on
page 163. – This is what the if...else construct is doing: it checks if the
first condition is true. If it is true, the following statement is executed, or else
an alternative is executed. This alternative is optional.
{if (condition) {
command-1(s)}
else { # this part is optional
command-2(s)} # this part is optional
}
The “else {command-2(s)}” is optional and can be omitted. Let us take
a look at a little program.
Program 32: if...else
1 # save as if-else.awk
2 # demonstrates if-else structure
3 # use with enzyme.txt
4 {if ($2 < 2) {
5 sum_b+=$2}
6 else {
7 sum_s+=$2}
8 }
9 END{
10 print "Sum of numbers greater than or equal 2: "sum_b
11 print "sum of numbers smaller than 2 : "sum_s
12 }
It simply calculates the sum of the value of all second fields of the file en-
zyme.txt that are smaller than 2 (if ($2 < 2)) and larger than or equal to
2 (else). Remember that “sum b += $2” (where sum b is a variable) is the
same as “sum b = sum b + $2” (see Sect. 11.5.1 on page 174). The output of
the program is shown in the following Terminal.
Terminal 100: if...else
1 $ awk -f if-else.awk enzyme.txt
2 Sum of numbers greater than or equal 2: 1.6
3 sum of numbers smaller than 2 : 2.5
4 $ cat enzyme.txt
5 Enzyme Km
6 Protease 2.5
7 Hydrolase 0.4
188 11 Awk
8 ATPase 1.2
9 $
Just to remind you of the content of enzyme.txt, it is printed out with the
cat command.
11.7.2 while...
While you read this script, concentrate! – The while statement does some-
thing (concentrating), while a condition is true (reading). In contrast to the
“do...while” construction (see Sect. 11.7.3 on the next page), first the state
of the condition is checked and, if the condition is true, commands are exe-
cuted. The result is a loop that will be repeated until the condition becomes
false. The syntax is:
{while (condition) {
command(s)}
}
It is quite easy to end up in an endless loop with the while command.
Keep this in mind whenever you use it!
Program 33: while
1 # save as while.awk
2 # demonstrates the while command
3 # reverses the order of fields
4 BEGIN{ORS=""}
5 {i=NF}
6 {while (i>0) {
7 print $i"\t"; i--}
8 }
9 {print "\n"}
Program 33 shows you an application which prints out the fields of each
line in reverse order. This means the last field is becoming the first, the sec-
ond last is becoming the second and so forth. Execution and output of the
program is shown in Terminal 101.
Terminal 101: while
1 $ awk -f while.awk enzyme.txt
2 Km Enzyme
3 2.5 Protease
4 0.4 Hydrolase
5 1.2 ATPase
6 $
will not generate a new line after execution. For each line of the input file
the number of fields (variable NF ) is stored in the variable i. In the while
loop, first the highest field number ($i = $(NF)) is printed, followed by a
tabulator character “\t”. Next, i is decremented by 1 (i--). The while loop
stops when i becomes 0. Then a newline character “\n” is printed and the
next line (record) read from the input file.
11.7.3 do...while...
Do not drink alcohol while you learn how to use control structures. – In
contrast to the while command (see Sect. 11.7.2 on the facing page), first one
or more commands are executed and then it is checked if the condition is true.
Only if the condition is still true, is execution of the commands repeated. The
syntax is:
{do {
command(s)}
while (condition)
}
11.7.4 for...
For each section, there is an example you should carefully study. – The for
loop is a special form of the while loop. We saw already a, though unusual,
example of a for in Section 11.5.4 on page 181. There it was used to read out
the complete content of an array.
The normal for loop consists of three parts: initialization, break condition
and loop counter. The syntax is:
The first part, the initialization, is executed only once. Here, a counter
variable is set. The condition is checked in every loop. Only when the condi-
tion is true will the commands be executed and the loop counter be executed.
As loop counter one uses usually the increment (++) or decrement (--) com-
mand.
The following program prints the quadratic powers of the numbers 1 to 10.
190 11 Awk
I guess the program almost explains itself. The whole script is executed
in a BEGIN block because we do not need any input file. The output record
separator variable ORS is set to the space character. This prevents the print
command from generating line breaks. “i∗∗2” raises the value of the variable
i to the power of 2 (see Sect. 11.8.2 on page 195). In line 7 we print a newline
character (\n). Execution of the program is shown in Terminal 102.
Terminal 102: for
1 $ awk -f for.awk
2 1 4 9 16 25 36 49 64 81 100
3 $
next – The next command forces awk to immediately stop processing the
current record and load the next record from the input file. The next command
is not only interesting in the context of loops but can be generally used in
awk scripts. With the command nextfile you can even force awk to skip the
current input file and load the first line of the next input file. If there is no
other input file, nextfile works like exit.
exit – The exit command causes awk to stop the execution of the whole awk
script. The syntax is
exit n
The number n is an optional extension. If you use it, it will be the return code
of your awk script. After execution of the exit statement, the script first tries
to execute the END block. However, if no END block is present, the program
stops immediately. If the exit command is executed in the BEGIN block, awk
does not try to jump to the END block but stops execution directly.
11.8 Actions
Until now, we have spent a lot of time on understanding patterns, variables
and control structures. This is indispensable knowledge in order to work with
a programming language like awk. However, the real power of awk unfolds
when we learn about its actions (functions) in this and the following sections.
You have learned already that awk actions are always enclosed in braces.
Everything enclosed in braces are actions, which are also called functions.
We have already made extensive use of the print function. When we worked
with arrays in Section 11.5.4 on page 181 we used the functions asort and
split. In this section we will deal with many more powerful functions. In
Section 11.8.5 on page 201 you will even learn how we define our own functions.
11.8.1 Printing
The default print command is “print $0”. Thus, if you use only print in
your script, awk will interpret this as “print $0”. The syntax of the print
command is
print item1, item2, ...
The items to be printed out are separated by commas. If you want to print
text, it must be enclosed in double quotes:
This command would print all lines of the file enzyme.txt into the file out-
put.txt. If the file output.txt does not exist, it will be created. If it exists, it will
be overwritten, unless you use >>, which appends the output to an existing
file.
Now let us take a look at some special characters. They are used to print
tabulators, start new lines or even let the system bell ring. The following list
shows the most important escape sequences.
\a You will hear an alert (system bell) when you print this
character.
\b This prints the backspace character. The result is that the
printing cursor will be moved backwards.
\n When this character is printed a new line will be started.
\t A horizontal tabulator is printed.
\v A vertical tabulator is printed.
The awk statement in line 1 of Terminal 103 prints out the first column of
the file enzyme.txt. After the content of the positional variable $1 is printed,
the cursor is placed three positions back (\b\b\b), then the character “x” is
printed. By this, the original content of the line is overwritten.
The command printf provides more control over the format of the output.
The syntax is:
Here, format is a string, the so-called format string, that specifies how to
output the items. Note that printf does not append a newline character! It
prints only what the format string specifies. Both the output field separator
variable OFS and the output record separator variable ORS are ignored.
11.8 Actions 193
The format string consists of two parts: the format specifier and an optional
modifier. Table 11.1 shows a number of available format specifiers and gives
examples of what they do.
Note that you need to give a format specifier for each item you want to
print. That is illustrated in Terminal 104.
Terminal 104: Format Specifier
1 $ awk ’BEGIN{printf("%g\n", 10.63, 5)}’
2 10.63
3 $ awk ’BEGIN{printf("%g%g\n", 10.63, 5)}’
4 10.635
5 $ awk ’BEGIN{printf("%g - %g\n", 10.63, 5)}’
6 10.63 - 5
7 $
In line 1 of Terminal 104 we use the format string “%g\n”. It reads: print
a number in the shortest possible way, followed by a newline character. Re-
member that printf does not start a new line by itself. As you can see from
the output in line 2, the second item is omitted. In line 3 we specify to print
2 numbers. They are printed one after the other. In line 5 we introduce a
separator between the numbers.
With modifiers you can further specify the output format. Modifiers are spec-
ified between the % sign and the following letter in the format specifier. Now
let us take a look at some available modifiers.
194 11 Awk
The following Terminal shows some examples of how the modifiers are
used in conjunction with the format modifiers for strings (%s) and floating
point numbers (%f).
Terminal 105: Modifiers
1 $ awk ’BEGIN{
2 printf("%s - %f\n", "chlorophyll", 123.456789)}’
3 chlorophyll - 123.456789
4 $ awk ’BEGIN{
5 printf("%6s - %6f\n", "chlorophyll", 123.456789)}’
6 chlorophyll - 123.456789
7 $ awk ’BEGIN{
8 printf("%15s - %15f\n", "chlorophyll", 123.456789)}’
9 chlorophyll - 123.456789
10 $ awk ’BEGIN
11 {printf("%-15s - %-15f\n", "chlorophyll", 123.456789)}’
12 chlorophyll - 123.456789
13 $ awk ’BEGIN
14 {printf("%+15s - %+15f\n", "chlorophyll", 123.456789)}’
15 chlorophyll - +123.456789
16 $ awk ’BEGIN{
17 printf("%.6s - %.6f\n", "chlorophyll", 123.456789)}’
18 chloro - 123.456789
19 $ awk ’BEGIN{
20 printf("%15.6s - %15.6f\n", "chlorophyll", 123.456789)}’
21 chloro - 123.456789
22 $ awk ’BEGIN{
23 printf("%-15.6s - %-15.6f\n", "chlorophyll", 123.456789)}’
24 chloro - 123.456789
25 $
11.8 Actions 195
Terminal 105 on the facing page gives you an impression of the function of
the modifiers. As you can see, they are placed between the % character and the
format control letter (here, s and f). All number values without or preceding
the period character (.) are width values, specifying the minimum width. In
lines 11, 14 and 23 the width value is preceded by another formatting char-
acter. All numbers behind the period character specify the precision of the
output. For details, you should consult the list above.
Finally, awk provides the command sprintf. With this function awk redi-
rects the output into a variable instead of a file.
Terminal 106: Print Into Variable
1 $ awk ’BEGIN{
2 > out=sprintf("%s","Hello World")
3 > print out}’
4 Hello World
5 $
In line 1 in Terminal 107 we set a new value for random number generator.
Then the results of some calculations are printed.
Probably the most important functions are those that manipulate text. The
following functions are of two different types: they either modify the target
string or not. Most of them do not. All functions give one or the other type
of return value: either a numeric value or a text string. As with the functions
for numerical calculations in Section 11.8.2 on the page before, this list of
functions is complete.
Terminal 108 illustrates how to use substr. Here we extract and print 3
characters from the 7th position of the value of the variable target.
5 at>gct<a>gct<a>gct<gc
6 $
As shown in Terminal 110, sub substitutes only the first occurrence of the
regular expression. Thus, the return value can only be 1 (one substitution) or
0 (regular expression not found in target string).
Terminal 111 gives you an example of the function gensub. Here, we sub-
stitute the 2nd occurrence of the regular expression in target. Note that the
original string is not modified. Instead, the modification of the text in the
variable target is the output of gensub.
198 11 Awk
index(target, find)
Search. – This function searches within the string target for the occurrence
of the string find. The index (position from the left, given in characters) of
the string find in the string target is returned (The first character in target is
number 1). If find is not present, the return value is 0.
Terminal 112: index
1 $ awk ’BEGIN{target="atgctagctagctgc"
2 > print index(target,"gct")
3 > print target}’
4 3
5 atgctagctagctgc
6 $
Terminal 112 gives you an example of the function index. It simply re-
turns the position of the given string.
length(target)
Length. – Another easy-to-learn command is length. As the name implies,
this function returns the length of the string target, or the length of the posi-
tional variable $0 if target is not supplied.
Terminal 113: length
1 $ awk ’BEGIN{target="atgctagctagctgc"
2 > print length(target)
3 > print target}’
4 15
5 atgctagctagctgc
6 $
5 3
6 Target: atgctagctagctgc
7 0: gctagctagct 1: gctagctagct 2: tagctag
8 $
In Terminal 115 we split the string saved in the variable target at positions
defined by the regular expression “g.t”. The resulting fields are saved in the
array named array, whereas the number of resulting fields is saved in n. The
for loop in line 3 is used to read out all elements of the array.
asort(array, destination)
Array Sorting. – This function returns the number of elements in the array
array. The content of array is copied to the array destination, which then
is sorted by its values (for more information see Sect. 11.5.4 on page 181).
The original indices are lost. Sorting is executed according to the information
given in Section 11.4.3 on page 169 (apple > a > A > Apple).
Terminal 116: asort
1 $ awk ’BEGIN{target="atgctagctagctgc"
2 n=split(target,array,/g.t/); asort(array)
3 for (i=1; i<=n; i++){printf "%s", array[i]" "}
4 print ""}’
5 a a at gc
6 $
200 11 Awk
The script in Terminal 116 is the same as in Terminal 115 on the preceding
page, but extended by the asort function. Thus, the elements of the array
are sorted according to their value, before they are printed out.
strtonum(target)
String To Number. – This function examines the string target and returns its
numeric value. You will hardly need this function.
Terminal 117: strtonum
1 $ echo "12,3" | awk ’{print strtonum($0)}’
2 12
3 $ echo "12.3" | awk ’{print strtonum($0)}’
4 12.3
5 $ echo "alpha12.3" | awk ’{print strtonum($0)}’
6 0
7 $
tolower(target)
To Lowercase. – Returns a copy of the string target, with all uppercase charac-
ters converted to their corresponding lowercase counterparts. Non-alphabetic
characters are left unchanged. For an example take a look at its counterpart
toupper.
toupper(target)
To Uppercase. – The opposite of the function tolower().
Terminal 118: toupper
1 $ awk ’BEGIN{target="atgctagctagctgc"
2 > print toupper(target)
3 > print target}’
4 ATGCTAGCTAGCTGC
5 atgctagctagctgc
6 $
Well, the function toupper really does what it says. Terminal 118 gives
you an example.
In the same way that you invoke an awk statement from a shell script, you
can invoke a shell command, or system call, from your awk script. The system
function executes the command given as a string.
Terminal 119: System Call
1 $ awk ’BEGIN{system("pwd")}’
2 /home/Freddy
11.8 Actions 201
3 Km Enzyme
4 2.5 Protease
5 0.4 Hydrolase
6 1.2 ATPase
7 $
In line 1 of Terminal 120 we define a function called out. The function does
not need any parameters, therefore the parentheses remain empty. The action
of our function is to print out the positional variables $2 and $1 in a nicely
formatted way. In line 2 we call the function, as if it were a normal command,
with out().
Now let us see how we can use parameters.
Terminal 121: Function with Parameters
1 $ awk ’function out(pre) {printf pre"%-5s %-10s\n", $2,$1}
2 {out("-> ")
3 out("# ")}’ enzyme.txt
4 -> Km Enzyme
5 # Km Enzyme
6 -> 2.5 Protease
7 # 2.5 Protease
8 -> 0.4 Hydrolase
9 # 0.4 Hydrolase
10 -> 1.2 ATPase
11 # 1.2 ATPase
12 $
The function in Terminal 121 is almost the same as the function in Termi-
nal 120. However, now we use the parameter pre. The value of pre is assigned
when we call the function with out("->") in line 2. Thus, the value of pre
becomes “->”. In line 3 we call the function with yet another parameter.
In the printf command in line 1 the value of pre is used as a prefix for the
output. The result is seen in lines 4 to 11.
Up to this point we simplified the world a little bit. We have said the function
out() in Terminal 120 does not require any parameter. This is not completely
true: it uses the variables $1 and $2 from the script. In fact, function parame-
ters are local variables. If we define parameters they become local variables. A
local variable is a variable that is local to a function and cannot be accessed or
edited outside of it, no matter whether there is another variable with the same
name in script or not. On the other hand, a global variable can be accessed
and edited from anywhere in the program. Thus, each function is a world for
itself. The following example illustrates this.
Terminal 122: Variables in Functions
1 $ awk ’BEGIN{a="alpha"
2 > print "a:", a
3 > fun(); print "b:", b}
4 > function fun(){print a; b="beta"}’
11.8 Actions 203
5 a: alpha
6 alpha
7 b: beta
8 $
The function fun defined in line 4 in Terminal 122 does not declare any
parameters. Thus, there are no local variables. The function does have access
to the variable a to which we assigned a value within the script in line 1. In
the same way, the script has access to the variable b that has been assigned
in the function body in line 4. Note that it makes no difference whether you
define your function at the beginning, in the middle or at the end of your
script. The result will always be the same.
The next two examples demonstrate the difference between local and global
variables in more detail.
Program 35: Variables in Functions
1 # save as fun_1.awk
2 BEGIN{
3 a="alpha"; b="beta"
4 print "a:", a; print "b:", b
5 fun("new")
6 print "a:", a; print "b:", b
7 }
8 function fun(a,b) {
9 print "fun a:", a; print "fun b:", b
10 b="BETA"
11 }
In Program 35 we use the two variables a and b within the script and
within the function fun. However, for the function fun the variables a and
b are declared as parameters and thus are local variables (line 8). Note that
when we call the function in line 5 with the command fun("new"), we deliver
only one parameter (new) to it. Terminal 123 shows what the program does.
Terminal 123: Variables in Functions
1 awk -f fun_1.awk
2 a: alpha
3 b: beta
4 fun a: new
5 fun b:
6 a: alpha
7 b: beta
The return command in line 1 of Terminal 125 invokes that the function
call (square($2,5)) in line 2 returns the value of y. Alternatively, without
the return command, you need to call the function and then state that you
want to print the value of y. Of course, this works only for a global variable
y. This is shown in Terminal 126.
Terminal 126: No return
1 $ awk ’
2 function square(x,a){y=x**2+23x-a}
3 {print $1" --- "$2" --- "square($2,5)y}
4 ’ enzyme.txt
5 Enzyme --- Km --- 18
6 Protease --- 2.5 --- 24.25
7 Hydrolase --- 0.4 --- 18.16
8 ATPase --- 1.2 --- 19.44
9 $
What does our function square do? Well, it simply does some math and
calculates the result of the quadratic equation from the Km values in column
two of the file enzyme.txt. Why is the result for “Km” 18? Because awk does
strange things when calculating with characters. Therefore, you should check
whether your variable really contains numbers and no characters.
Terminal 127: Number Check
1 $ awk ’
2 > function square(x,a){
3 > if (x~/[A-Z]/){
4 > return ("\""x"\" no number")}
5 > else{
6 > y=x**2+23x-a; return y}
7 > }
8 > {print $1" --- "$2" --- "square($2,5)}
9 > ’ enzyme.txt
10 Enzyme --- Km --- "Km" no number
11 Protease --- 2.5 --- 24.25
12 Hydrolase --- 0.4 --- 18.16
206 11 Awk
Let us assume the following scenario. You have one file containing the names
of enzymes and the value of their catalytic activity. In a second file you have
a list of enzyme names and the class to which they belong. Now you wish to
fuse this information into one single file. How can you accomplish this task?
One way is to use getline! In order to go through this example step by
step, let us first create the desired files. The first file is actually enzyme.txt.
You should still have it. The second file, we name it class.txt, has to be created.
Terminal 128: enzyme.txt and class.txt
1 $ cat enzyme.txt
2 Enzyme Km
3 Protease 2.5
4 Hydrolase 0.4
5 ATPase 1.2
6 $ cat > class.txt
7 Protease = Regulation
8 ATPase = Energy
9 Hydrolase = Macro Molecules
10 Hydrogenase = Energy
11 Phosphatase = Regulation
12 $
11.9 Input with getline 207
Terminal 128 shows the content of the example files we are going to use.
Note that both files use different field separators: in enzyme.txt the fields are
space-delimited, whereas in class.txt the fields are separated by “=”.
Now, let us take a look at the program we are applying to solve the task.
Program 37: File Fusion
1 # save as enzyme-class.awk
2 # assigns a class to an enzyme
3 # requires enzyme.txt and class.txt
4 BEGIN{FS=" = "
5 while ((getline < "class.txt") > 0){
6 class[$1]=$2}
7 FS=" "
8 }
9 {if (class[$1]==""){
10 print $0 "\t\tClass"}
11 else{
12 print $0 "\t\t" class[$1]}
13 }
The program is separated into two main parts: a BEGIN block (lines 4 to 8)
and the main body (lines 9 to 13). In line 4 we assign the string “=” to
the field separator variable FS. Then we use a while loop, spanning from line
5 to 8. This loops cycles as long as the following condition is true:
we read one line of the file class.txt. Note, that the “less” character (<) does
not mean “less than”, but is used like redirections. The complete line is saved
in $0. Then, the lines are split according to the value of the field separator
FS and saved in $1...$n, where n is the number of fields (which is saved in
NF ). As said before, getline returns 1 if it could read a line, 0 if it reached
the end of the file, and -1 if it encountered an error. Thus, what we do in line
5 is to check whether getline read a record. In line 6 we use an array called
class (see Sect. 11.5.4 on page 181). The enzyme name, saved in $1, is used
as the array index, whereas the corresponding class, saved in $2, is used as
corresponding array value. For example, the element class[“ATPase”] has the
value “Energy”. Now we know what the while loop is doing: it reads records
from the file class.txt and saves its fields in the array called class. Finally, in
line 7, we set back the field separator to the space character. Now, the main
208 11 Awk
In the previous example, the record caught by getline was saved in $0 and
split into the positional field variables $1...$n. Additionally, the variable NF
(number of fields) was set accordingly. However, this might not be appropriate
because under certain circumstances the record read from the file by getline
would overwrite the record from the input file stated in the command line. In
such cases you can save the line read by getline in a variable. The syntax is:
The record is then saved in Variable and neither $0 nor NF are changed. Of
course, the record is not split into fields either. However, that you could do
with the split function (see Sect. 11.8.3 on page 196).
The following example shows you how to save a record in a variable.
Terminal 130: getline and variables
1 $ awk ’
2 > {getline class < "class.txt"
3 > print class" <> " $0}
4 > ’ enzyme.txt
5 Protease = Regulation <> Enzyme Km
6 ATPase = Energy <> Protease 2.5
7 Hydrolase = Macro Molecules <> Hydrolase 0.4
8 Hydrogenase = Energy <> ATPase 1.2
9 $
The script shown in Terminal 130 reads for each line of enzyme.txt one
line of class.txt and prints out both. The line from class.txt is saved in the
variable class.
11.9 Input with getline 209
Terminal 131 shows the application of getline with and without a vari-
able. In line 2, the output of the shell command date is saved in the positional
variable $0, whereas line 3 assigns the output of the shell command pwd to
the variable pwd.
What happens if the shell command produces a multiple-line output? Then
we need to capture the lines with a while loop.
Terminal 132: getline With Multiple Lines
1 $ awk ’BEGIN{
2 > while ("ls" | getline >0){
3 > print $0
4 > }}’
5 class.txt
6 enzyme-class.awk
7 enzyme.txt
8 genomes2.txt
9 structure.pdb
10 $
11.10 Examples
This section contains some examples which are to invite you to play around
with awk scripts.
The program file and the structure file are required to be in the same directory.
The for loop in line 5 runs for every field in the current record. The follow-
ing if statement checks whether the current field contains a point-separated
number. If so, the number is added to the variable sum and printed in line
10, together with the line number.
awk -f dna2prot.awk
When executed without input filename, the program accepts input from the
command line. Your input will be translated when you hit Enter . Then you
can enter a new DNA sequence. To abort the program press Ctrl +C
. Of
course, you could also pipe a sequence to the program or load a file contain-
ing a DNA sequence.
Program 40: Translate DNA
1 # save as dna2prot.awk
2 # translates DNA to protein
3 BEGIN{c["atg"]="MET"; c["ggc"]="THY"; c["ctg"]="CYS"}
4 {i=1; p=""}
5 {do {
6 s=substr($0, i, 3)
7 printf ("%s ", s)
8 {if (c[s]==""){p=p" "} else {p=p c[s]" "}}
9 i=i+3}
10 while (s!="")}
11 {printf("\n%s\n", p)}
Let us shortly formulate the problem: the input will be a DNA sequence,
the output is to be a protein sequence. Thus, we need to chop up the sequence
string into triplets. This we do with the function substr, as shown in line 6
of Program 40. The codon is saved in the variable s. Then we look up if the
value of s is in our translation table (line 8). The translation table is in an
associative array called c. If the codon in s does not exist in the array c (thus,
c[s] is empty), 4 spaces are added to the variable p, or else, the value of c[s]
is added to p. We repeat this process as long as there are codons (s must not
be empty). All the rest is nice printing. Note that we use the “do...while”
construct here. If we started with a check if s is empty, the loop would never
start because it is initially empty!
212 11 Awk
This file gives more information than we currently need. The first line
specifies the content of the space-separated fields in each line. The first fields
contain the full amino acid name, followed by its one-letter and three-letter
code, the molecular mass excluding one water molecule (which is split off
during the formation of the peptide bound), the side chain composition, its
natural abundance in nature and finally the number of sulfur (S), nitrogen
(N), oxygen (O), carbon (C) and hydrogen (H) atoms of the side chain. Our
program first reads the information of this file and then loads the protein
sequences from the FASTA file. The corresponding file name is given as com-
mand line parameter. The full code is given in the following Program.
Program 41: Protein Atomic Composition
1 # save as aacomp.awk
2 # calculates atomic composition of proteins
3 # usage: awk -f aacomp.awk protein.fasta
4
6 BEGIN {
7 while (getline < "aa_atoms.txt" > 0) {
8 # read number of S,N,O,C,H:
9 data[$2]=$4" " $7" " $8" " $9" " $10" " $11
10 }
11 }
12 # DECLARE FUNCTION to calculate fraction
13 function f(a,b){
14 if (a != 0 && b != 0){
15 return a/b*100
16 }
17 }
18 # PROGRAM BODY
19 {
20 if ($0 ~ /^>/) {
21 print L, MW, S, N, O, C, H, "-",
22 f(S,t), f(N,t), f(O,t), f(C,t), f(H,t)
23 print $0
24 t=0; L=0; MW=0; S=0; N=0; O=0; C=0; H=0
25 }
26 else {
27 i=1; L=L+length($0); tL=tL+length($0)
28 do {
29 aa=substr($0,i,1)
30 i++
31 # begin main
32 split(data[aa],comp) # comp[1]=mass...
33 MW=MW+comp[1]; S=S+comp[2]; N=N+comp[3]
34 O=O+comp[4]; C=C+comp[5]; H=H+comp[6]
35 tMW=tMW+comp[1]; tS=tS+comp[2]; tN=tN+comp[3]
36 tO=tO+comp[4]; tC=tC+comp[5]; tH=tH+comp[6]
37 T=tS+tN+tO+tC+tH; t=S+N+O+C+H
38 # end main
39 }
40 while (aa != "")
41 }
42 }
43 END {
44 print L, MW, S, N, O, C, H, "-",
45 f(S,t), f(N,t), f(O,t), f(C,t), f(H,t)
46 print "\n>SUMMARY"
47 print tL, tMW, tS, tN, tO, tC, tH, "-",
48 f(tS,T), f(tN,T), f(tO,T), f(tC,T), f(tH,T)
49 }
The BEGIN body spans from lines 6 to 11. Here we read the data file
aa atoms.txt. Take a look at Section 11.9.1 on page 206 for a detailed ex-
planation of the while construct. The second field of the file aa atoms.txt,
which is available via the variable $2, contains the single-letter amino acid
code. This we use as index for the array data in line 9. Each array element
consists of a space-delimited string with the following information: molecu-
lar weight and the number of sulfur, nitrogen, oxygen, carbon and hydrogen
atoms of the side chain. Thus, for alanine data would look like:
data["A"] = "71.079 0 0 0 1 3"
In line 13 we declare a function. Remember that it does not make any differ-
ence where in your program you define functions. The function uses the two
local variables a and b. The function basically returns the ratio of a to b as
defined in line 15 of Program 41 on the page before.
The main body of the program spans from line 19 to 42. In line 20 we check
if the current line of the input file starts with the > character. Since that
line contains the identifier, it is simply printed by the print command in line
23. Before that, the data of the preceding protein are printed (lines 21 and
22), i.e. the length (L), molecular weight (MW ), count of atoms (S, N, O,
C, H ), a dash and the fraction of the atom counts. This is accomplished by
calling the function f with the number of atoms (S, N, O, C, H ) and the
total number of side chain atoms of the protein (t ) in line 22. Since there is
no protein preceding the first protein (wow, what an insight) we will get the
dash only when the program passes these lines for the very first time (see line
2 of Terminal 133 on the facing page). For the same reason we check in line
13 whether any parameter handed over to the function equals zero: otherwise
we might end up with a division by zero. In line 24 of Program 41 on the page
before we set all counters to zero. If the current line does not start with the >
character, we end up in the else construct starting in line 26. A local counter
i is set to 1 and the number of amino acids in the current line [length($0)]
is added to the counter variables L and tL. While the value of L reflects the
length of the current protein, tL is counting the total number of amino acids
in the file (this is why tL is not reset in line 24). Now we reach the kernel
of the program. The do...while construct between lines 28 and 40 of Pro-
gram 41 on the preceding page is executed as long as there are amino acids in
the active line. These are getting less and less because in line 29 we extracted
them one by one with the substr function (see also Sect. 11.8.3 on page 196).
The value of the variable aa is the current amino acid. In line 32 we use the
data array we generated in line 9 and apply the current amino acid (sitting
in aa) as key. The array elements are split with the split function (see also
Sect. 11.8.3 on page 196) into the array comp. The default field delimiter used
by the split function is the space character. This is why we introduced space
characters in line 9. The result of this is that comp[1] contains the molecular
weight, comp[2] the number of sulfur atoms and so forth. These data we add
to the corresponding variables in lines 33 to 36.
11.10 Examples 215
Finally, we apply the END construct to print out the data of the last protein
in lines 44 and 45 (which resemble lines 21 and 22) and the cumulative data
in lines 47 and 48 of Program 41 on page 213.
Now let us see how the program performs. As an input file you can use any
file containing protein sequences in FASTA format. The following file, named
proteins.seq, is one possible example:
proteins.seq
1 >seq1
2 MRKLVFSDTERFYRELKTALAQGEEVEVITDYERYSDLPEQLKTIFELHKNKSGMWVNV
3 TGAFIPYSFAATTINYSALYIFGGAGAGAIFGVIVGGPVGAAVGGGIGAIVGTVAVATL
4 GKHHVDIEINANGKLRFKISPSIK
5 >seq2
6 MVAQFSSSTAIAGSDSFDIRNFIDQLEPTRVKNKYICPVCGGHNLSINPNNGKYSCYNE
7 LHRDIREAIKPWTQVLEERKLGSTLSPKPLPIKAKKPATVPKVLDVDPSQLRICLLSGE
8 TTPQPVTPDFVPKSVAIRLSDSGATSQELKEIKEIEYDYGNGRKAHRFSCPCAAAPKGR
9 KTFSVSRIDPITNKVAWKKEGFWPAYRQSEAIAIIKATDGIPVLLAHEGEKCVEASRLE
10 LASITWVGSSSDRDILHSLTQIQHSTGKDFLLAYCVDNDSTGWNKQQRIKEICQQAGVS
The program is executed as shown in Terminal 133. Please note that lines
4 to 5 and 7 to 8 comprise usually one line of output.
Terminal 133: Output of aacomp.awk
1 $ awk -f aacomp.awk proteins.seq
2 -
3 >seq1
4 142 15179.5 2 38 55 411 794 - 0.153846 2.92308 4.23077
5 31.6154 61.0769
6 >seq2
7 295 32608.2 10 110 141 852 1718 - 0.353232 3.88555 4.98057
8 30.0954 60.6853
9
10 >SUMMARY
11 437 47787.7 12 148 196 1263 2512 - 0.290487 3.58267 4.74461
12 30.5737 60.8085
13 $
As stated above, we first get one line containing only a dash. Then follow
the data for all proteins. Finally, we get the summary for all proteins. This is
as if we would concatenate all proteins to one and perform the calculation. The
order of the output is: protein length in amino acids, molecular weight, number
of sulfur, nitrogen, oxygen, carbon and hydrogen atoms in the side chain, a
dash, and the fraction of sulfur, nitrogen, oxygen, carbon and hydrogen atoms.
of disulfide bridges near the protein surface. This can be catalyzed by a group
of proteins known as thioredoxins. Thioredoxins are small proteins with a
redox-active disulfide bridge present in the characteristic active site amino
acid sequence Trp-Cys-Gly-Pro-Cys. They have a molecular mass of approxi-
mately 12 kDa and are universally distributed in animal, plant and bacterial
cells. As stated above, the target enzymes, which are regulated by thioredoxin,
must have two cysteine residues in close proximity to each other and to the
surface. Comparison of primary structures reveals that there is no cysteine-
containing consensus motif present in most of the thioredoxin-regulated target
enzymes [12].
A couple of years ago we found a correlation between the concentration of re-
duced thioredoxin and the activity of a hydrogenase [18]. In order to identify
potential cysteines that could be targets for thioredoxin, one can nowadays
investigate protein structure data of crystallized hydrogenases. We just need
to search for cysteine sulfur atoms that are no further than 3 Åapart. In Sec-
tion 9.1 on page 128 you have already got an insight into the organization of
files containing structural information. For us important lines look like this:
$1 $2 $3 $4 $5 $6 $7 $8 $9 $10 $11 $12
ATOM 5273 SG CYS L 436 7.96 9.76 73.31 0.75 14.37 S
The first field ($1 ) says that the line contains the information about an
atom. The unique atom identity is given in the second field ($2 ). Fields $7
to $9 contain the x-, y-, z-coordinates of the atom, respectively. Fields $12
and $4 tell us that the coordinates belong to a sulfur atom (S) of a cysteine
(CYS).
Okay, with this information in hand, the approach to nail down our problem
should be clear!? We catch all lines where $1 matches ATOM, $4 matches
CYS and $12 matches S, respectively. A decent script to catch these lines is
where 1FRV.pdb is the structure file. From the matching lines we extract the
coordinates and save them together with the unique atom identifier $2. This
we could do with an array:
Finally, we calculate the distance in space from all saved sulfur atoms. We
are lucky that the coordinates are given in Å; but which formula should be
applied to calculate the distance? Well, do you remember your math classes?
d = (x2 − x1 ) + (y2 − y1 ) + (z2 − z1 )
Now we are missing two things: the program and the structure files. You
should test two hydrogenase structures for the existence of nearby cysteines:
the structure with the identifier 1FRV from the bacterium Desulfovibrio gigas
and the structure with the identifier 1FRF from the bacterium Desulfovibrio
fructosovorans. Both structure files can be downloaded from the Protein Data
Bank at
www.rcsb.org/pdb/cgi/export.cgi/
1FRV.pdb?format=PDB“&pdbId=1FRV“&compression=None
and
www.rcsb.org/pdb/cgi/export.cgi/
1FRF.pdb?format=PDB“&pdbId=1FRV“&compression=None
respectively. You should save these files in the same directory as the awk script
itself. Now it is time to take a look at the code in following Program.
Program 42: Cysteine Sulfur Distance
1 # save as distance.awk
2 # searches close cysteine sulfur atoms in a structure
3 # requires a structure file (*.pdb)
4 # usage: awk -f distance.awk structure.pdb
5
13 END{
14 for (key1 in cys_x) {
15 for (key2 in cys_x) {
218 11 Awk
16 if (key1 != num2) {
17 dx=cys_x[key1]-cys_x[key2]
18 dy=cys_y[key1]-cys_y[key2]
19 dz=cys_z[key1]-cys_z[key2]
20 distance[key1"-"key2]=sqrt(dx^2+dy^2+dz^2)
21 if (distance[key1"-"key2]<3) {
22 i++
23 text=key1"-"key2": "distance[key1"-"key2]
24 candidate[i]=text
25 }
26 }
27 }
28 }
29 print "\nCandidates (contain doublets)..."
30 for (i in candidate) {print candidate[i]}
31 }
4 ...
5 CYS546: 25.704 -2.660 83.522
6
Terminal 134 shows the output of Program 42 on the facing page for the
hydrogenase of the bacterium Desulfovibrio fructosovorans (file 1FRF.pdb).
I had to cut the output a little bit. In fact there are 20 more cysteines in
the structure. Lines 8 to 11 in Terminal 134 tell us that the sulfur atoms of
cysteines 259 and 436, and 75 and 546 are less than 3 Åapart. However, we
do not yet know whether these sulfur atoms are close to the protein surface.
To check this, we need a molecular structure viewer like Rasmol. Rasmol
can be download from openrasmol.org. Figure 11.2 shows the structure of
Desulfovibrio fructosovorans hydrogenase with highlighted candidate sulfur
atoms.
By turning the molecule around its axis it becomes clear that only the sul-
fur atoms of cyteines 259 and 436 are at the surface of the protein. Thus, these
are potential targets for thioredoxin. This would be the point to turn from
dry science to wet science, i.e. leave the computer and do the experiment...
220 11 Awk
Exercises
Do some exercises which are hopefully not awkward. For the exercises, create
two files, one containing the line “1,2,3,4,5” and another containing the line
“one;two;three;four;five”. Name them numbers.txt and words.txt, respectively.
11.1. Reverse the order of the numbers in numbers.txt and separate them
with dashes (-).
11.2. Print the first line of numbers.txt followed by a column (:), followed by
the first line of words.txt. Do this for all lines in both files and save the result
in number-words.txt. The first line of the output should look like this: “1 :
one”.
1 2 3 4 5 6 7 8 9 10
11 12 13 14 15 16 17 18 19 20
21 22 23 24 25 26 27 28 29 30
31 32 33 34 35 36 37 38 39 40
41 42 43 44 45 46 47 48 49 50
Use the printf command to formate the output.
11.4. In this exercise you are going to use the file genomes2.txt from Termi-
nal 79 on page 167. Add one field saying how many base pairs an average gene
has.
11.5. How can you avoid the display of doublets in Program 42 on page 218?
Does this affect the calculation time?
12
Perl
We have to deal with perl! Why? Because it is a great and, especially among
scientists, widely applied programming language – but not least also for his-
torical reasons: perl was initially developed to integrate features of sed and
awk within the framework provided by the shell. As you learned before, awk
is a programming language with powerful string manipulation commands and
regular expressions that facilitates file formatting and analysis. The stream
editor sed nicely complements awk. In 1986, the system programmer Larry
Wall worked for the US National Security Agency. He was responsible for
building a control and management system with the capability to produce re-
ports for a wide-area network of Unix computers. Unhappy with the available
tools, he invented a new language: perl. Beside integrating sed and awk he
was inspired by his background in linguistics to make perl a “human”-like
language, which enables the expression of ideas in different ways. This feature
makes perl scripts sometimes hard to read because there are too many ways
to do one thing. The first version of the Practical Extraction and Report Lan-
guage, perl, was released as open source in 1987. Since then perl has been
growing and growing and growing – now even providing tools for biologists in
the form of special commands via bioperl (see Sect. 12.13 on page 254).
what I said in Chapter 11 on page 163: you must learn only one programming
language in order to understand all programming languages. The rest is a
question of the right syntax. I can highly recommend you to buy one of these
little pocket-sized Pocket Reference books on perl [17]. An old version is all
you need. It provides a good compilation of all available commands and costs
only around 5 Euro or US$.
perl scriptfile.pl
In the latter case, the script file must begin with “#!/bin/perl”.
You might have noticed the differences to awk. There is no option necessary to
execute a script file, while you need the option -e in order to run a command
line script.
Note that each line in script files ends with the semicolon character!
12.3 Variables
In perl, all variables are preceded by the dollar character. The only excep-
tion is the assignment of arrays and hashes (see below). perl discriminates
between three different variable types: scalar variables, arrays and hashes.
Scalar variables are normal variables having either a text string or a number
as values. In Section 11.5.4 on page 181 you learned what arrays are. An array
variable stores a list of scalar variables. Each individual array element has an
12.3 Variables 223
index and a value. The following array named x consists of 3 elements with
different values:
The first element has the index “0”. If the index was a text instead of a
number, we are talking about associative arrays or hashes:
While awk did not discriminate between arrays and hashes, perl does, as you
will see soon.
Variables are very flexible in perl. The value of a variable is interpreted in a
certain context. Thus, the value might be interpreted either as a number or a
text.
Terminal 136: Variable Context
1 $ perl -e ’$x=7;
2 > print 1+$x,"\n";
3 > print "1+$x\n"’
4 8
5 1+7
6 $
In Terminal 136 we assign the value “7” to the variable called x. In the
arithmetic calculation in line 2, the value of x is considered to be a number.
This is also called a scalar context. In contrast, in line 3, x is considered to
be a text character because it is enclosed in double quotes. It is important to
keep this in mind.
12.3.1 Scalars
The names of scalar variables always begin with the dollar character ($).
Terminal 137: Scalars
1 $ perl -e ’$x="20"; print "$x\n"’
2 20
3 $
Terminal 137 shows you how to assign a value to, and print the value of,
a scalar variable. Of course, you can do all kinds of arithmetic calculations
(+, -, *, /, **), increment (++) and decrement (--) numeric values, and use
arithmetic operators for the assignment of scalar variables. The following list
gives you an overview of the most common assignment operators for numeric
scalar variables.
224 12 Perl
$a += x Addition: $a = $a + x
$a -= x Substraction: $a = $a - x
$a *= x Multiplication: $a = $a * x
$a /= x Division: $a = $a / x
$a **= x Exponentiation: $a = $a ** x
$a++ Post-increment: increment $a by 1, then use the value of
$a in the expression in which $a resides.
++$a Pre-increment: use the current value of $a in the expression
in which $a resides, then increment $a by 1.
$a-- Post-decrement: decrement $a by 1, then use the value of
$a in the expression in which $a resides.
--$a Pre-decrement: use the current value of $a in the expression
in which $a resides, then decrement $a by 1.
In summary, apart from the dollar character preceding variable names, the
handling of scalar variables is pretty much the same as in shell programming
or awk.
12.3.2 Arrays
Arrays allow you to do almost everything with text strings and are one of the
most frequently used and most powerful parts of perl. The name of an array
variable is preceded with the vortex character (@) when assigned and when
the whole array is recalled but with the dollar character ($), when a single
element is recalled.
Terminal 138: Arrays
1 $ perl -e ’@files=‘ls‘; print "@files"’
2 absolute.pl
3 age.pl
4 circle.pl
5 $ perl -e ’@files=‘ls‘; print "$files[1]"’
6 age.pl
7 $
There are different ways in which an array can be filled with values.
12.3 Variables 225
Once an array has been assigned, you might want to add or remove array ele-
ments. There are some very comfortable commands available for these tasks.
In the following list, we assume that the array is named array.
shift (@array)
Remove First Element. – Removes and returns the first element of array.
pop (@array)
Remove Last Element. – Removes and returns the last element of array.
@array=()
Delete Array. – Removes all array elements by assigning an empty list to it.
226 12 Perl
delete @array[a,b,...]
Delete Elements. – Deletes the specified elements of array. Returns a list of
the deleted elements.
In Terminal 140 you see some simple examples of some of the functions
we have just learned.
Terminal 140: Remove and Add Elements
1 $ perl -e ’@a=(1,2,3); print pop @a,"\t",@a,"\n"’
2 3 12
3 $ perl -e ’@a=(1,2,3); print shift @a,"\t",@a,"\n"’
4 1 23
5 $ perl -e ’@a=(1,2,3); unshift @a,(5,6); print "@a\n"’
6 5 6 1 2 3
7 $ perl -e ’@a=(1,2,3); push @a,(5,6); print "@a\n"’
8 1 2 3 5 6
9 $ perl -e ’@a=(1,2,3);
10 > print "Del: ",delete @a[1,2],"\tRest: @a\n"’
11 Del: 23 Rest: 1
12 $
Reading Arrays
perl provides a very convenient command in order to read out all elements
of an array.
Terminal 141: foreach
1 $ perl -e ’@a=(one,two,three,four);
2 > foreach $e (@a){ print $e}’
3 onetwothreefour
4 $ perl -e ’@a=(one,two,three,four);
5 > foreach $e (@a){ print "$e\n"}’
6 one
7 two
8 three
9 four
10 $
In the example shown in Terminal 141 we assign a list to the array a and
use the function foreach to read out all array elements. foreach creates a
loop which will be cycled until the last element of the array a is reached. The
array elements are saved in the variable e.
12.3 Variables 227
Selecting Elements
Of course, you can use the foreach function in order, for example, to search
for certain elements or modify individual elements that fit a search pattern;
but perl has shortcuts for this.
In Terminal 142 we create an array named a and use the grep function in
line 2, to select all array elements that consist of one single character. There-
fore, we apply the regular expression “^.$”. In line 7 all array elements which
have a single number as value are picked. In both examples the original array,
followed by the selection, are printed. The variable $ is the default input,
output and pattern-searching space variable (see Sect. 12.3.4 on page 233).
Terminal 143 illustrates the function of the map function. Again, we make
use of the special variable “$ ”. The dots in line 2 are perl’s operators for
string concatenation.
228 12 Perl
reverse @array
Reverse Element Order. – Reverses the order of the array elements. The orig-
inal array remains untouched. The reverse function returns the new array.
Terminal 144: Reverse
1 $ perl -e ’@a=(one,two,three,four);
2 > @r=reverse @a; print @r,"\n"’
3 fourthreetwoone
4 $ perl -e ’@a=(one,two,three,four);
5 > @r=reverse @a; print "@r\n"’
6 four three two one
7 $
Terminal 144 demonstrates the reverse function. Excursion: note the ef-
fect of different quotation settings in the print functions in lines 2 and 5. In
Section 12.6.1 on page 242, we will focus at the print function.
The example in Terminal 145 illustrates the application of the sort func-
tion. Be aware that sort sorts the array elements by default alphabetically.
Array Information
The following list gives you an overview of some functions in order to obtain
information about an array. In the list we assume that the array is named
array.
The perl script in Terminal 146 makes use of the scalar function in order
to demonstrate the behaviour of an array with deleted elements: the number
of array elements does not change!
12.3.3 Hashes
A hash is an unordered collection of “index => value” pairs. Hashes are also
known as associative arrays, since they define associations between indices
and values. While perl’s arrays can only digest numeric indices, these can be
either numeric or strings in hashes. The indices are sometimes referred to as
keys. You might ask: why does perl discriminate between arrays and hashes,
while awk does not (see Sect. 11.5.4 on page 181)? Well, hashes have higher
memory demands. Thus, perl gives you the opportunity to be economic with
your RAM (random access memory).
There is one other important thing you should know about hashes: they can
be used as databases. Okay, pretty easy databases, but anyway. A hash can
be exported into a file and is then available for other perl programs. We will
talk in more detail about this feature later (see Sect. 12.7 on page 244).
The name of hashes is preceded with the percent character (%) when they
are assigned and with the dollar character ($) when a single element is recalled.
Terminal 147: Hashes
1 $ perl -e ’%h=(A,Adenine,C,Cytosine);print "$h{C}\n"’
2 Cytosine
3 $
Terminal 147 gives a short example of a hash variable. Here, the hash is
created by assigning a list to it. The list is read in duplets. Thus, the first
list element is used as the first index, which points to the second list element,
230 12 Perl
and so forth. However, there are different methods how to create hashes and
assign values to them.
Creating Hashes
In Terminal 148 you find all three possible ways how to create a hash
variable (here named code). In line 1 a single hash element is created by direct
assignment. In lines 3 and 5 two differently formatted lists are assigned to the
hash variable code. The last method (line 5) is probably the most readable
one. Note that, in contrast to arrays, hash elements are enclosed in braces
and not in brackets. Although not employed in the given examples, it is often
wise to enclose text characters in single quotes, like
From this example you see that perl is pretty much forgiving for different
formats, also with respect to space characters.
Line 7 of Terminal 148 illustrates how to recall more than one hash element
at once. It is important to precede the hash variable name with the vortex
character (@), because you recall a list.
Hashes are not interpreted when enclosed in double quotes. This affects the
action of the print command.
Terminal 149: Hash in Double Quotes
1 $ perl -e ’@nuc=(a,c,g,t);
2 > %hash=(C,Cys,A,Arg);
3 > print "@nuc\n";
4 > print @nuc,"\n";
5 > print "%hash\n";
6 > print %hash,"\n"’
7 a c g t
8 acgt
12.3 Variables 231
9 %hash
10 AArgCCys
11 $
As you can see in the example in Terminal 149, arrays are interpreted both
within double quotes and outside. If an array is double-quoted, the elements
are separated by spaces in the output. Hashes are interpreted only outside of
double quotes. Thus, the output does not look that nice. However, you can
assign the space character to the output field separator variable “$,”. Just
add the line
$,=" ";
before the first print command in Terminal 149. Try it out. $, is a built-in
variable. More of these are described in Section 12.3.4 on page 233.
Deleting Values
You just saw how to create hashes and add values to them. How can you
delete an element?
Terminal 150: Delete Hash Element
1 $ perl -e ’%hash=(C,Cys,A,Arg);
2 > print %hash,"\n";
3 > delete $hash{C};
4 > print %hash,"\n"’
5 AArgCCys
6 AArg
7 $
As with arrays, the delete function removes an element, here from the
hash. This is illustrated in Terminal 150.
Hash Information
Of course, you can also obtain some information about hashes. The first thing
you might want to know is the size of a hash variable. For this purpose you
can use the keys function.
Terminal 151: Size of Hash
1 $ perl -e ’%hash=‘ls -l‘;
2 > print eval(keys %hash),"\n"’
3 7
4 $
The example in Terminal 151 uses the keys function to evaluate the size
of the hash variable called hash. In line 1 we use the system call ls -l in
order to fill the hash (although the resulting hash variable does not make too
232 12 Perl
much sense). We then call the keys function embedded in the eval (evaluate)
function. If we omitted eval, “keys %hash” would return all indices (keys).
This is shown in the next example.
Terminal 152: Keys and Values
1 $ perl -e ’%hash=(C,Cys,A,Arg);
2 > print keys %hash,"\n";
3 > print values %hash,"\n"’
4 AC
5 ArgCys
6 $
In Terminal 152 we apply the functions keys and values. In a text context
both return a list with all indices and all values, respectively. Remember that
the indices in hashes are also called keys. Of course, you could also save the
lists in an array variable instead of printing to standard output.
The keys function is very useful in order to print out all elements of a hash
variable. The foreach construct in the following example resembles the one
we used for arrays (see Terminal 141 on page 226).
Terminal 153: foreach
1 $ perl -e ’%hash=(C,Cys,A,Arg); $\="-";
2 > foreach (keys %hash){print $hash{$_}}
3 > $\=""; print "\n"’
4 Arg-Cys-
5 $
Note that we do not specify any output variable in the foreach construct
in Terminal 153. In Terminal 141 on page 226 we used the variable $e. If we do
not specify any variable, then the default input/output variable $ is used. In
the example in Terminal 153 we also make use of the output record separator
variable $\. By default, this is set to nothing. This is why we always use the
newline escape sequence (\n) to generate a line break.
Reverse a Hash
In Section 12.3.2 on page 224 we saw how the function reverse can be applied
to reverse the order of an array. If applied to hashes, reverse returns a list of
pairs in which the indices and values are reversed. Thus, the indices are now
values and the values are now indices.
Terminal 154: Reverse Hash
1 $ perl -e ’%hash=(C,Cys,A,Arg);
2 > print %hash,"\n";
3 > %hash=reverse %hash;
4 > print %hash,"\n"’
5 AArgCCys
12.3 Variables 233
6 ArgACysC
7 $
The examples in Terminal 154 illustrate the effect of the reverse function
on a hash. The reverse function only works correctly with hashes if all the
values are unique. This is because a hash requires its keys to be unique.
We have already come across some built-in variables that perl provides for
different purposes. In fact, there are over 50 built-in variables. Only some of
them are listed below.
Take a look at perl’s manpages for more variables. Those listed above are
the most commonly used ones. Most of the other variables are useful for the
experienced programmer doing fancy things but not for us scientists doing
basic data analysis and data formatting.
234 12 Perl
12.4.1 if...elseif...else
The syntax of the if structure is the same as in awk (see Sect. 11.7.1 on
page 187), except the additional elseif statement. The if construct either
executes a command if a condition is true or skips the command(s). With the
help of elseif and else several cases can be distinguished.
if (condition-1) {
commands-1}
elseif (condition-2) {
command(s)-2;}
else {
command(s)-3;}
There are no semicolon characters behind the braces.
Examples of die and warn are shown in Terminal 155. In contrast to the
nonsense conditions in line 1 and 4, one would rather test for the existence of
a file or something similar.
12.4.3 while...
The while construct performs an action while a condition is true (see
Sect. 11.7.2 on page 188).
while (condition) {
command(s);}
The while command first checks if the condition is true or false and then
executes the command(s). There are still no semicolon characters behind the
braces.
12.4.4 do...while...
Strongly related to while, the do...while construct first executes com-
mand(s) and then checks for the state of the condition (see Sect. 11.7.3 on
page 189).
do {
command(s);}
while (condition)
There are, as usual, no semicolon characters behind the braces.
12.4.5 until...
The until repetition structure is simply the opposite of while. The body of
until repeats while its condition is false, i.e. until its condition becomes true.
until (condition) {
command(s);}
Note that there are no semicolon characters behind the braces.
12.4.6 do...until...
As with do...while, the do...until statement differs from the until state-
ment by the time when the condition is tested with respect to command
execution.
do {
command(s);}
until (condition)
First, the command(s) are executed, then the state of the condition is tested.
Funny, but there are no semicolon characters behind the braces.
236 12 Perl
12.4.7 for...
The application of perl’s for construct exactly resembles the one in awk (see
Sect. 11.7.4 on page 189).
for (initialization; condition; counter){
command(s);}
Hm, are there no semicolon characters behind the braces? No!
12.4.8 foreach...
The foreach construct allows you to iterate over a list of values, which is
either provided directly or via an array.
foreach (list-or-array){
command(s);}
There are never, ever semicolon characters behind the braces. The nice thing
with explicitly given lists is that you can state ranges like “1..5” or “A..D”,
representing the lists “1, 2, 3, 4, 5” and “A, B, C, D”, respectively.
Terminal 156: foreach
1 $ perl -e ’foreach $no (1..4){
2 > print $no} print "\n"’
3 1234
4 $ perl -e ’foreach (A..D){
5 > print $_} print "\n"’
6 ABCD
7 $
The two little scripts in Terminal 156 illustrate the use of lists. In the first
example, in each cycle the values 1 to 4 are assigned to the variable no. In the
second example the general purpose variable $ is used instead.
Some more examples can be found in Terminal 141 on page 226 and Ter-
minal 142 on page 227. Especially the commands grep and map in Termi-
nal 142 on page 227 are worth to look at.
12.4 Decisions – Flow Control 237
Loops are nice, controlled loops are nicer. The while, until, for and foreach
statements fall into the class of loop structures. Like awk (see Sect. 11.7.5 on
page 190), perl allows you to control the program flow through loops.
next – The next statement skips the remaining commands in the body of
a loop and initiates the next cycle (iteration) of the loop. In the following
program flowchart loop stands for either while, until, for or foreach.
loop (condition){ <-
command(s); |
next; >-
command(s);
}
The array element ARGV[0] in line 1 of Terminal 157 contains the com-
mand line parameter, here the sequence in line 8. This sequence string is split
into an array (see Sect. 12.3.2 on page 224) and each element is extracted,
one after the other, in the foreach loop. In line 3 we check if the character is
a nucleotide or not. If it is not, a dash is printed in line 4 and the next cycle
starts, or else, the nucleotide is printed. Note that [acgtACGT] is a regular
expression, which must be enclosed by slashes.
last – The last command causes immediate exit from the loop. In the follow-
ing program flowchart loop stands for either while, until, for or foreach.
loop (condition){
command(s);
last; >-
command(s); |
} |
<-
238 12 Perl
The program execution continues with the first statement after the loop struc-
ture.
redo – After execution of the redo command the loop returns to the first
command in the body of the loop without evaluating the condition. In the fol-
lowing program flowchart loop stands for either while, until, for or foreach.
loop (condition){
command(s); <-|
redo; >-|
command(s);
}
Conditions
In Section 11.4 on page 167 we learned a lot about the comparison of text
strings and numbers with certain patterns. These comparisons yield either
true or false and are commonly used as conditions for program flow-control
structures. The list below is a compilation of the most important ones. If you
can answer the question with yes, the condition is true. Usually, $B is not a
variable but a fixed number or string.
As I said above, there are more operators available. Those described here
are the most commonly used ones.
12.5 Data Input 239
Well, it is nice to have a refrigerator, but it is bad if you cannot put things into
it or get things out. In a sense, perl scripts and refrigerators have something
in common! In this section you will see different ways which let your program
read and write data.
The simplest way to feed a program with data, though not the most comfort-
able one, is the command line. One way is to provide data in the command
line when you call the program. We used this technique in Terminal 157 on
page 237. The command line parameters are accessible via the array ARGV.
The variable ARGV[0] is the first element in the command line and so forth.
The elements are separated by space characters.
Another way is to ask the user for some input during the execution of the
script.
Terminal 158: Input
1 $ perl -e ’print "Enter Sequence: "; $DNA=<STDIN>;
2 > print "The sequence was $DNA";
3 > print "Thank you\n"’
4 Enter Sequence: gactagtgc
5 The sequence was gactagtgc
6 Thank you
7 $
Terminal 158 shows you an example of how you can obtain user input. The
important command is “$DNA=<STDIN>”. <STDIN> is called a filehandle. You
will see more of these soon. The filehandle <STDIN> initiates a request. All
characters you have typed are saved in the variable DNA after you hit Enter
.
Note that the newline character created by pressing Enter is saved, too. As
we will see later in Section 12.9 on page 247, you can remove it with the
command chomp.
12.5.2 Internal
Your program might require some default input, like a codon table. This you
attach to the end of a script after the “ END ” statement.
Program 43: Internal Data
1 # save as data.pl
2 print <DATA>;
3 __END__
4 Here you can save your data
5 in many lines
240 12 Perl
12.5.3 Files
Apart from the default filehandles <STDIN> and <DATA> you can define your
own and get input from files. This is what you most probably want to do.
Program 44: Input From A File
1 # save as open-file.pl
2 # input a filename
3 # the content of the file is displayed
4 print "Enter Sequence File Name: ";
5 $file=<STDIN>;
6 unless (open (SEQFILE, $file)){
7 die "File does not exist\n"}
8 @content=<SEQFILE>;
9 close SEQFILE;
10 print @content;
Program 44 shows you how to read data from a file. This program is
even a little bit interactive. Line 4 prints a text asking for a filename. Line
5 stops program execution and requires user input. After you have entered
the filename, press Enter . With the open command the filename saved in the
variable file is opened. SEQFILE is the filehandle for this file. If you want to
refer to the opened file, you have to use its filehandle. It can have any name
you like. By convention, filehandles are written uppercase. We also check with
the unless...die construct, whether the file exists. In line 8, the whole file
content is assigned to the array content. Thus, content[0] will contain the
first line of the file and so forth. Next, we close the file with the command
close. Again, we use the filehandle to tell the system which file to close. This
is obvious here; however, you could have many files open at once. Each file
must have its own filehandle. Finally, the value of the array content, i.e. the
file content, is displayed.
In the previous example we assigned the filehandle to an array. In such cases,
the whole file is copied to the array, each line being one array element. This is
a somewhat special case because usually the input file is read line by line. By
default, lines (or better: records) are expected to be separated by the newline
character. However, you are free to change the record separator by assigning
a value to the input record separator variable $/. A typical file-reading pro-
cedure is shown in the next example.
Terminal 159: Read File By Lines
1 $ perl -e ’open (INPUT, "sequences.txt") or die;
2 > $i=1
3 > while (<INPUT>){
4 > print "Line-$i: $_";
12.5 Data Input 241
5 > $i++}’
6 Line-1: >seq11
7 Line-2: accggttggtcc
8 Line-3: >Protein
9 Line-4: CCSTRKSBCJHBHJCBAJDHLCBH
10 $
In Terminal 159 the file sequences.txt is read line by line. The while loop
iterates as long as there are lines to read. <INPUT> becomes false when the
end of the file sequences.txt has been reached. The active line (record) is avail-
able via the special variable $ . Note that each line consists of all its characters
plus the newline character (line break) at the end. Therefore, we do not use
the newline character “\n” in the print command in line 4. At the beginning
of the program we use the command or in conjunction with open. This is
another way of writing unless.... Programmers like shortcuts!
Sometimes, you might want to read a file as blocks of characters. This can be
done with the command read.
Terminal 160: read
1 $ perl -e ’open (INPUT, "sequences.txt") or die;
2 > $i=1; while (read (INPUT,$frac,10)){
3 > print "Line-$i: $frac\n";
4 > $i++}’
5 Line-1: >seq11
6 acc
7 Line-2: ggttggtcc
8
9 Line-3: >Protein
10 C
11 Line-4: CSTRKSBCJH
12 Line-5: BHJCBAJDHL
13 Line-6: CBH
14
15 $
The input file we use in Terminal 160 is the same as we used in Ter-
minal 159. However, now we instruct perl only to read 10 characters from
sequences.txt with the command
The read command requires the filehandle, a variable name where to save
the string and the number of characters to extract. Count the characters in
the output by yourself and keep in mind that the invisible newline character
counts as well.
242 12 Perl
Okay, the print command is not really new any more. We have used it over
and over in all Terminals in this chapter. Anyway, be aware that the print
command does not automatically append a newline character. This is one im-
portant difference to awk. However, you can modify the default output field
and record separators by changing the values of the variables $, and $\, re-
spectively. Remember also that, in contrast to variables enclosed in single
quotes, those enclosed in double quotes are expanded.
Like awk, perl offers you to print formatted text strings either onto the screen
(printf) or into a variable (sprintf). Both commands work exactly as de-
scribed in Section 11.8.1 on page 191 for awk. I am sure that you still remember
how they worked, don’t you?
Do you recall here documents from shell programming? These allow you to
display large blocks of text (refresh your memory in Sect. 8.4.2 on page 103).
You initiate a here document with the << operator. The operator is followed
by any arbitrary string (the identifier) – there must not be any space charac-
ter between the operator and the identifier. All following text is regarded as
coming from the standard input, until the identifier appears a second time.
For perl to recognize the closing identifier, it must be both unquoted and at
the beginning of an empty line. No other code may be placed on this line.
Usually, variables within the here document are expanded. However, if the
identifier is single-quoted, variables are not expanded.
Program 45: Here Document
1 # save as heredoc.pl
2 $text=<<TEXT;
3 This is line one
4 this is line 2
5 TEXT
6 print <<’TEXT’;
7 It follows the text
8 just saved in $text
9 TEXT
10 print "$text";
12.6 Data Output 243
Note that the value of variable text in line 7 of Program 45 is not extracted
because the identifier “TEXT” in line 6 is single-quoted.
12.6.3 Files
Very often one wishes to save some results directly in a file. This is basically
as easy as reading a file.
Program 46: Save To File
1 # save as save-seq.pl
2 # appends a sequence to the file sequences.txt
3 # sequences.txt will be created if it does not exist
4 # sequences are saved in fasta format
5 print "Enter Sequence Name: ";
6 $seqname=<STDIN>;
7 print "Enter Sequence: ";
8 $seq=<STDIN>;
9 open (FILE, ">>sequences.txt");
10 print FILE ">$seqname";
11 print FILE $seq;
12 close FILE;
Program 46 saves a sequence together with its name in a file called se-
quences.txt. The sequence name and the sequence itself are provided via com-
mand line inputs in lines 6 and 8. In line 9 we open the file sequences.txt in
the append mode (>>) and assign the filehandle FILE to it. Now, we can
use the filehandle in conjunction with the print command in order to save
data in the file sequences.txt. If this file does not exist, it will be created. If it
exists, data will be appended to the end. Finally, in line 12, we close the file.
As you could see in the previous example, files can be opened in different
modes. The required mode is prefixed to the filename. The following list shows
all possible modes with the corresponding prefix.
244 12 Perl
If you do not give any mode (as we did in Terminal 44 on page 240), the
file is opened only for reading.
Of course, you can also use back referencing as described in more detail in
Section 9.2.7 on page 137. Thus, $1 would be the first parenthesized subex-
pression that matched, $2 the second and so on.
The first four escape sequences may be used inside or outside of character
classes ([...]).
246 12 Perl
In Program 47 the data is read from the end of the file. Each line is concate-
nated to the variable text. The dot in line 3 is the concatenation command.
In line 4 we print out the content of text. Then we search for the pattern
“cc.>”. The modifier s tells perl to embed newline characters; thus, they
can be matched with dot character (.). However, the caret (^), standing for
the beginning of a line, and the dollar character ($), standing for the end of a
line, no longer work in this mode. The following list gives an overview of the
most important modifiers.
As with the match command m, regular expressions are used with the
substitution command s and the transliteration command tr, respectively.
These commands are discussed in Sections 12.9.1 and 12.9.2, respectively.
12.9 String Manipulations 247
I guess you are pretty much accustomed to the substitution command because
we have used it already a lot with sed and awk. The syntax is
$var =∼ s/start/ATG/g
$var =∼ tr/acgt/TCGA/
If the value of the variable var was a DNA sequence, it would be complemented
and converted to uppercase. Thus, “atgcgt” would become “TACGCA”. If you
omit the binding operator (=∼), the value of the built-in variable $ would be
changed.
The transliteration comes with two very useful modifiers.
For example, the following command would replace all a’s with A’s, all b’s
with B ’s, and delete all c’s and d ’s:
tr/abcd/AB/d
Without the modifier d, all c’s and d ’s would be replaced by the last replace-
ment character, here the “B ”.
An example of the application of the s modifier is:
tr/A-Z/A-Z/s
In this section you will learn how to apply the most frequent text string ma-
nipulation commands. Most functions are homologous to awk. Thus, take a
look at Section 11.8.3 on page 196 for examples.
chomp target
Remove Input Record Separator. – This command is usually used to remove
the newline character at the end of a string. In fact, chomp removes the value
of the input record separator variable $/ (which is by default the newline
character) from the end of a string. The variable can be either a scalar or
an array (then all elements are reached). The variable is modified and the
number of changes is returned.
Program 48: chomp
1 # save as data.pl
2 while (<DATA>){
3 chomp $_;
4 print $_}
5 print "\n";
6 __END__
7 Here you can save your data
8 in many lines
chop target
Remove Last Character. – This function chops off the last character of a scalar
or the last element of an array. The array is deleted by this action! chop mod-
ifies the variable and returns the chopped character or element.
lc target
Lowercase. – Returns a lowercase version of target, which remain untouched.
uc target
Uppercase. – Returns an uppercase version of target, which remain untouched.
length target
Length. – Returns the length of the string target.
With the string manipulation commands presented in this section you can
do basically everything. Remember that you first have to formulate clearly
what your problem is. Then start to look for the right commands and play
around until everything works. Very often, small errors in the use of regular
expressions cause problems. You must be patient and try them out thoroughly.
12.10 Calculations
Needless to say that perl offers some built-in arithmetic functions to play
around with. The most important ones are listed in alphabetical order below.
You need the common logarithm (base 10) of x ? How about your math?
Some years ago? Okay, do not worry, I had to look it up, too... The logarithm
of x to base b is ln(x)/ln(b) where ln is the logarithm to base e. An appropriate
command would be
$x = log($x)/log(10)
It would be nice to have a function for this. This we are going to learn next.
250 12 Perl
12.11 Subroutines
Imagine your programs need to calculate at many places the common loga-
rithm of something. Instead of writing $x = log($x)/log(10) all the time
(see Sect. 12.10 on the page before), we would be better off writing our
own function. We came already across user-defined functions in awk (see
Sect. 11.8.5 on page 201). The concept of user-defined functions is the same
in perl; however, they are called subroutines and the arguments are passed
differently to the subroutine. Thus, the general syntax becomes:
sub name{
my($par1,$par2,...)=@_;
command(s);
return $whatever
}
name(par1,par2...)
my($par1,$par2,...) = @
the parameters are collected and saved in the the variables par1, par2 and so
on. The command my restricts the usage of these variables to the subroutine.
You should declare all variables used in subroutines with “my $var” before
you use them. You can also declare them while assigning them, like
my $var = 1.5
Once a variable is declared in this fashion, it exists only until the end of the
subroutine. If any variable elsewhere in the program has the same name, you
do not have to worry.
Okay, now back to practice. In the following example we create a subroutine
that calculates the logarithm of x to the base b.
Program 49: Subroutine
1 # save as sublog.pl
2 # demonstrates subroutines
3 # calculates logarithm
4 print "Calculator for LOG of X to base B\n";
5 print "Enter X: "; $x=<STDIN>;
6 print "Enter B: "; $b=<STDIN>;
7 printf("%s%.4f\n", "The result is ", callog($x,$b));
8
9 sub callog{
12.12 Packages and Modules 251
10 my($val,$base)=@_;
11 return (log($val)/log($base));
12 }
The effect of the first six lines of Program 49 you should know. New is the
call of the subroutine callog at the end of line 7. Two parameters, the variables
x and b, respectively, are passed to the subroutine callog. The subroutine itself
spans from line 9 to line 12. In line 10, the parameters are collected and saved
in the variables val and base, respectively. These variables are declared with
my and thus are valid only within the subroutine. The return value of the sub-
routine is the result of the calculation in line 11. Ultimately, “callog($x,$b)”
is substituted by this return value. If necessary, you could also return a list of
variables, i.e. an array. The result is formatted such that it is displayed with
4 decimal digits (“%.4f”; recall Sect. 11.8.1 on page 191 for more details).
Terminal 164 shows the typical output of Program 49.
Terminal 164: Subroutine
1 $ perl sublog.pl
2 Calculator for LOG of X to base B
3 Enter X: 100
4 Enter B: 10
5 The result is 2.0000
6 $
If you need to, you can design your program to call subroutines from within
subroutines. Sometimes, subroutines do not require any parameters; then you
call them with a set of empty parentheses, like name( ). You can place your
subroutines wherever you want in your program. However, I advise you to
write them either at the beginning or at the end. This is handier, especially
with large programs.
5 # calculates logarithm
6 sub callog{
7 my($val,$base)=@_;
8 return (log($val)/log($base));
9 }
10
Program 50, which is a package file, contains only 2 special features. Line
3 tells perl that this is a package and line 12 tells the program which calls the
package that everything worked fine. These two lines are necessary to define
a package.
Now let us use FreddysPackage.pm.
Program 51: Use A Package
1 # save as require.pl
2 require FreddysPackage;
3 print "Calculator for LOG of X to base B\n";
4 print "Enter X: "; $x=<STDIN>;
5 print "Enter B: "; $b=<STDIN>;
6 printf("%s%.4f\n", "The result is ",
7 FreddysPackage::callog($x,$b));
5 use Exporter;
6 our @ISA=qw(Exporter);
7
12.13 Bioperl 253
8 # export subroutine
9 our @EXPORT=qw(&callog);
10
11 # calculates logarithm
12 sub callog{
13 my($val,$base)=@_;
14 return (log($val)/log($base));
15 }
16
The package in Program 52 has more lines than Program 50 on the pre-
ceding page. Actually, I will not go into all details here. Whenever you want
to make your subroutines in a package file easily accessible, add lines 5, 6
and 9. In line 9 you specify the subroutines you want to export within the
parentheses, preceded with the ampersand character (&). If there are more
subroutines, you have to separate them by spaces, like
our @EXPORT=qw(&sub1 &sub2 &sub3);
While the package has become more complicated, the program becomes sim-
pler.
Program 53: Modules
1 # save as use.pl
2 use FreddysPackage;
3 print "Calculator for LOG of X to base B\n";
4 print "Enter X: "; $x=<STDIN>;
5 print "Enter B: "; $b=<STDIN>;
6 printf("%s%.4f\n", "The result is ",
7 callog($x,$b));
With the knowledge of this section you should be able to construct your
own package with the functions (modules) you frequently use. One great re-
source for such modules is provided by CPAN (Comprehensive Perl Archive
Network) at www.cpan.org. Maybe the day will come when you place your
own packages at this site.
12.13 Bioperl
Bioperl is not a new programming language but a language extension [13].
Officially organized in 1995, The Bioperl Project is an international associa-
tion of developers of open-source perl tools for bioinformatics, genomics and
254 12 Perl
12.15 Examples
Let us finish this chapter with some examples.
The following program translates a DNA sequence into an RNA sequence and
computes the reverse complement of the latter.
Program 54: Transform DNA
1 #!/usr/bin/perl -w
2 # save as dna-rna-protein.pl
3 # playing around with sequences
12.15 Examples 255
11 $RNA = $DNA;
12 $RNA=~s/T/U/g;
13 print "RNA Sequence:\n$RNA\n";
14
15 $RevCompl=reverse $RNA;
16 $RevCompl=~tr/ACUG/UGAC/;
17 print "Reverse Complement:\n$RevCompl\n";
perl dna-rna-protein.pl
In the first line we see that the option -w is activated. This means that perl
prints warnings about possible spelling errors and other error-prone constructs
in the script. In line 5, a message is displayed that informs the user that some
input is required. The DNA sequence coming from standard input (<STDIN>)
is assigned to the variable DNA. In line 7, the newline character is erased with
the command chomp. Then the sequence is made uppercase (uc) and displayed.
In line 11, the sequence saved in DNA is assigned to the variable RNA. Then,
in line 12, all T ’s are substituted by U ’s. The result is printed in line 13.
In line 15, the value of the variable RNA is reversed and assigned to the
variable RevCompl. Next, in line 16, the reversed sequence is complemented
by the transliterate function “tr/.../.../”. This is quite a nice trick to
complement nucleotide sequences. Again, the result is displayed.
8 sub gc_content{
9 my $seq = shift;
256 12 Perl
Program 55 takes its input from the command line. The DNA sequence
from the command line is assigned to the variable seq in line 3. After the
sequence has been printed, the variable seq in line 5 is forwarded to the sub-
routine gc-content, which starts at line 8. The command shift in line 9
collects the first element from the array “@ ” and assigns it to the local (my)
variable seq. Then, in line 10, a very smart action is performed. Do you rec-
ognize what is going on? – The transliteration function is applied to erase all
upper- and lowercase G’s and C ’s. The number of replacements is divided by
the length of the complete sequences and multiplied by 100, thus delivering
the portion of guanines and cytosines. This line is a very good example of
the versatility of perl. “tr/gGcC//” is immediately replaced by the modified
string. Furthermore, since it is situated in a scalar, i.e. arithmetic, context,
“tr/gGcC//” is replaced by its size. The result of the whole calculation is then
assigned to seq. The result is returned and saved in gc in line 5. Finally, the
result is printed out with two digits after the decimal point (%.2f).
Restriction enzymes recognize and cut DNA. The discovery of restriction en-
zymes laid the basis for the rise of molecular biology. Without the accurate
action of restriction enzymes (or, more precisely endonucleases) gene cloning
would not be possible. Those restriction enzymes that are widely used in
genetic engineering recognize palindromic nucleotide sequences and cut the
DNA within these sequences. Palindromes are strings that can be read from
both sides, resulting in the same message. Examples are: “Madam I’m Adam”
or “Sex at noon taxes”.
With the following program, we wish to identify all restriction enzyme recog-
nition sites of one or more enzymes. We do not care where the enzyme cuts
(this we leave for the exercises). Furthermore, we assume that a list with re-
striction enzyme names and recognition sites is provided. What do we have
to think of? Well, the most tricky thing will be the output. How shall we
organize it, especially with respect to readability? A good solution would be
to have first the original sequence, followed by the highlighted cutting sites.
Try out the following program.
Program 56: Digestion
1 #!/bin/perl -w
2 # save as digest.pl
3 # Provide Cutter list
4 # Needs input file "cutterinput.seq" with DNA
5 %ENZYMES=(XmaIII => ’cggccg’, BamHI => ’ggatcc’,
6 XhoI => ’ctcgag’, MstI => ’tgcgca’);
7 $DNA="";
12.15 Examples 257
25 sub cutting{
26 my ($DNA,%ENZYME)=@_;
27 foreach $ENZ (values %ENZYME){
28 @PARTS=split($ENZ,$DNA);
29 $CUT=""; $i=0;
30 while(++$i <= length($ENZ)){$CUT=$CUT.’+’}
31 foreach $NUC (@PARTS){
32 $NUC=~ s/./-/g;
33 }
34 $"=$CUT; push(@OUT,sprintf("%-40s%s", "@PARTS","\n"));
35 }
36 chomp @OUT; return @OUT;
37 }
Just to give you an idea. It took me around 2 hours to write this program
from scratch. The basic idea was immediately obvious: pattern matching. For-
matting the output and simplifying the code took more than half of the time.
Okay, let us take a look at program 56. The list of restriction enzymes and
their respective cutting sites is saved in a hash variable in lines 5 and 6. The
input sequences will be read from the file cutterinput.seq. This is done in lines
9 to 11. The input sequences are assigned to the variable DNA, after newline
characters have been erased with chomp. In line 13, the sequence is converted
to lowercase and then, in line 14, passed to the subroutine cutting. This sub-
routine is provided with two data sets: the DNA sequence in the variable DNA
and the restriction enzymes plus their cutting sites in the hash variable EN-
ZYMES. The subroutine starts in line 25. First, it collects the forwarded data
and saves them in local variables. Then, for each (foreach in line 27) restric-
tion enzymes recognition site the DNA sequences are a) cut at the recognition
site (line 28), b) a string consisting of as many “+” characters as there are
nucleotides in the current recognition site is created (line 30), c) all non-cut
258 12 Perl
nucleotides are converted to “-” characters (line 32) and finally d) the result
is appended (push) to the array OUT with the sprintf command. Note that
the cutting site, which was converted to plus characters, is assigned to the list
delimiter ($”). After this loop has been cycled for all restriction enzymes, the
newline characters are removed from the end of the array OUT (line 36) and
OUT is returned. Now we are back in line 14. The result is saved in the array
OUTPUT. Each element of OUTPUT consists of the cutting pattern of DNA
by a specific restriction enzyme. Furthermore, each element of OUTPUT is
as long as the length of DNA, i.e. the DNA sequences. This is very inconve-
nient for the final output: with long sequences, the lines would not fit onto the
screen. Thus, in line 16, we initiate a while loop that takes away portions of
30 characters of DNA (line 18) and OUTPUT (line 21). The embedded while
loop from lines 19 to 22 extracts each element from OUTPUT and adds the
name of the restriction enzyme from the array NAMES, which was assigned
in line 15. Finally, everything is displayed with the printf command ranging
from line 20 to 21.
Terminal 165: Restriction Enzymes
1 $ perl digest2.pl
2 tatcgatgcatcgcatgtcactagcgccgg
3 ------------------------------ XhoI
4 ------------------------------ BamHI
5 ---------------------------+++ XmaIII
6 ------------------------------ MstI
7 ccggtagtgcatcgagctagctaggatccc
8 -----------------------++++++- XhoI
9 ------------------------------ BamHI
10 +++--------------------------- XmaIII
11 ------------------------------ MstI
12 gtcgtcgtcgtgatcgctcgagac
13 ------------------------ XhoI
14 ----------------++++++-- BamHI
15 ------------------------ XmaIII
16 ------------------------ MstI
17 $
This is probably the most difficult example of the whole book. Take your time
and go through the example slowly and with concentration – then you will
master it!
An important task in biology is to measure the difference between DNA or pro-
tein sequences. With a reasonable quantitative measure in hand we can then
start to construct phylogenetic trees. Therefore, each sequence is compared
with each sequence. The obtained distance measures would be filled into a so-
called distance matrix, which in turn would be the basis for the construction
of the tree. What is a reasonable distance measure? One way is to measure
the dissimilarity between sequences as the number of editing steps to convert
one sequence into the other. This resembles the action of mutations. Allowed
editing operations are deletion, insertion and substitution of single characters
(nucleotides or amino acids) in either sequence. Program 57 on page 261 cal-
culates and returns the minimal number of edit operations required to change
one string into another. The resulting distance measure is called Levenshtein
distance. It is named after the Russian scientist Vladimir Levenshtein, who
devised the algorithm in 1965 [7]. The greater the distance, the more different
the strings (sequences) are. The beauty of this system is that the distance is
intuitively understandable by humans and computers. Apart from biological
sequence comparison there are lots of other applications of the Levenshtein
distance. For example, it is used in some spell checkers to guess which word
from a dictionary is meant when an unknown word is encountered.
How can we measure the Levenshtein distance? The way in which we will solve
the problem to calculate the Levenshtein distance is called dynamic program-
ming. Dynamic Programming refers to a very large class of algorithms. The
basic idea is to break down a large problem into incremental steps so that, at
any given stage, subproblems with subsolutions are obtained. Let us assume
we have two sequences: ACGCTT (sequence 1) and AGCGT (sequence 2).
The algorithm is based on a two-dimensional matrix (array) as illustrated in
Fig. 12.1 on the next page.
The array rows are indexed by the characters of sequence 1 and the
columns are indexed by the characters of sequence 2. Each cell of the ar-
ray contains the number of editing steps needed to convert sequence 1 to
sequence 2 at the actual cell position. In other words: each cell [row, col] (row
and col being the row and column index, respectively) represents the mini-
mal distance between the first row characters of sequence 1 and the first col
characters of sequence 2. Thus, cell [2, 2] contains the number of editing steps
needed to change AC to AG, which is 1 (convert C to G), and cell [2, 4] con-
tains the number of step to convert AC to AGCG, which is 2 (delete the Gs).
How do we obtain these numbers? During an initialization step column 0 is
filled with numbers ranging from 1 to the length of sequence 1 (dark shad-
owed in Fig. 12.1). This corresponds to the number of insertions needed if
sequence 2 was zero characters long. Equally, the row 0 is filled with numbers
260 12 Perl
6 $one=$ARGV[0]; $two=$ARGV[1];
7 print "$one <=> $two\n";
8 print "Levenshtein Distance: ",distance($one, $two), "\n";
9
10 sub distance {
11 ($a,$b)=@_;
12 $la=length($a); $lb=length($b);
13 if(!$la) {$result=$lb;return $result}
14 if(!$lb) {$result=$la;return $result}
15 foreach $row (1 .. $la) {$m[$row][0]=$row}
16 foreach $col (1 .. $lb) {$m[0][$col]=$col}
17 foreach $row (1 .. $la) {
18 $a_i=substr($a,$row-1,1);
19 foreach $col (1 .. $lb) {
20 $b_i=substr($b,$col-1,1);
21 if ($a_i eq $b_i) {
22 $cost=0 # cost for match
23 } else {
24 $cost=1 # cost for mismatch
25 }
26 $m[$row][$col]=min($m[$row-1][$col]+1,
27 $m[$row][$col-1]+1,
28 $m[$row-1][$col-1]+$cost);
29 }
30 }
31 return $m[$la][$lb];
32 }
33 sub min {
34 my($a,$b,$c)=@_;
35 $result=$a;
36 if ($b < $result) {$result=$b};
37 if ($c < $result) {$result=$c};
38 return $result
39 }
15 and 16 the first row and column are initialized (see Fig. 12.1 on page 260).
Take a look at Section 12.4.8 on page 236 to recall the syntax of the foreach
construct. The foreach loops in lines 17 and 19 step through each row and
column of the two-dimensional matrix $m, respectively. In lines 18 and 20 the
characters corresponding to the actual cell are extracted from the sequences
saved in $a and $b. If these characters match (checked in line 21), then $cost is
set to 0, or else to 1. The construct spanning from line 26 to 28 assigns a value
to the current cell [row, col] saved in the array $m. Therefore, the subroutine
min is applied. This subroutine, spanning from lines 33 to 39, simply returns
the smallest of three numbers. What is happening in lines 26 to 28? The cost
for an insertion or deletion is always 1. Thus, 1 is added to value above the
current cell ($col-1 ) and left of the current cell ($row-1 ). Depending on the
value of $cost, 1 or 0 is added to the cell diagonally above and to the left of
the current cell (row-1,col-1 ). The minimum of these 3 numbers is assigned
to the current cell. In this way the program surfs through the matrix. The
last cell to which a value is assigned is the bottom right one (see Fig. 12.1 on
page 260). Its value corresponds to the Levenshtein distance of both strings
and is the return value (line 31) of the subroutine distance. Execution of the
program is shown in Terminal 166.
Terminal 166: Levenshtein Distance
1 $ perl levenshtein.pl atgctatgtcgtgg tcatcgtacgtacg
2 atgctatgtcgtgg <=> tcatcgtacgtacg
3 Levenshtein Distance: 7
4 $
How about the complexity of our algorithm? Well, assuming that the
length of each sequence is n, the running time as well as the memory de-
mand for the matrix is n2 .
Finally, a little extension for our program that should help you to understand
calculating the Levenshtein distance. Add
print ”\n”;
to line 17 and
printf (”%3s”, $m[$row][$col]);
to line 28, respectively. Then the program prints out the matrix you know
from Fig. 12.1 on page 260 as shown in the following Terminal.
Terminal 167: The Matrix
1 $ perl levenshtein2.pl at aggtgt
2 att <=> aggtgt
3
4 0 1 2 3 4 5
5 1 1 2 2 3 4
6 2 2 2 2 3 3Levenshtein Distance: 3
7 $
12.15 Examples 263
I hope you liked our excursion to dynamic programming and all the other
examples! I know it is sometimes a hard job to read other people’s programs,
especially with perl; but take your time and go through the examples. This
is the best way to learn perl. Learning is largely a process of imitating. Only
when you can reproduce something can you start to modify and develop your
own stuff.
Exercises
The best exercise is to apply perl in your daily life. However, to start with,
I add some exercises referring to the examples in Section 12.15 on page 254.
12.1. Expand Program 54 on page 255 such that it translates the original
DNA sequence into the corresponding protein sequence. Use the standard
genetic code.
12.2. Expand Program 54 on page 255 such that it translates both the orig-
inal and the complemented into the corresponding protein sequence. Use the
standard genetic code.
12.3. Expand Program 54 on page 255 such that it can load several sequences
in fasta format from a file and converts them.
12.6. Modify Program 56 on page 257 such that the name of the input file
can be given at the command line level.
12.7. Modify Program 56 on page 257 such that the position in terms of
nucleotides is displayed at the beginning and end of each line.
12.8. Modify Program 56 on page 257 such that the list of restriction enzymes
and their recognition sites is imported from a file.
12.9. Unfortunately, Program 165 on page 258 has a little bug: if the restric-
tion enzyme recognition site lies at the end of the DNA sequence, the last
nucleotide is not recognized. Try it out! Find a way to cure the problem.
A
Appendix
In this chapter you will find a number of practical hints. They are for the
advanced use of Unix/Linux.
A.5 Devices
The following list shows you the path of some more or less special devices.
Unix/Linux regards all hardware as devices. They can all be found in the sys-
tem folder /dev. Remember that Unix/Linux treats everything as files, even
hardware devices.
I must warn you: depending on the Unix/Linux installation you are work-
ing with, these paths might be different.
Fig. A.1. In this example the CD-ROM drive is mounted on the directory cdrom
in Freddy’s home directory. Usually, only the superuser (root) is allowed to execute
the mount command
We have already seen in Section A.5 on the preceding page that CD-ROM
drives and so on are devices. From the Windows operating system you are
used to put an external memory medium like a floppy disk or CD-ROM into
the drive and access it via e.g. the file manager. Depending on the settings of
your Unix/Linux system you are working with, using external memory media
can be more difficult. By default, Unix/Linux is protected and you can access
external memory media only if the system administrator has given you the
required permissions.
When you mount a filesystem, you should know what kind of filesystem it is.
There several different ones around:
In order to check which file systems and partitions are available use the
df commando. This command reports the filesystem disk space usage.
Terminal 168: Filesystem
1 $ df -Th
2 Filesystem Type Size Used Avail Use% Mounted on
3 /dev/hda2 ext3 18G 3.1G 13G 18% /
4 /dev/hda1 ext3 99M 14M 79M 15% /boot
5 none tmpfs 93M 4.0K 93M 1% /dev/shm
6 $
In Terminal 168 we use the df command with the options -T and -h, which
leads to the addition of the filesystem-type information and makes the output
human-readable, respectively.
Now, let us see how you can add a filesystem like a floppy disk or CD-ROM
drive.
A.6 Mounting Filesystems 269
The floppy disk drive is added to the existing filesystem with the command
mount. Usually, you must be superuser (root) in order to be able to use mount.
If you are working on your own system, you should have the root password.
You can always login as root with the command su.
Terminal 169: Mount Floppy Disk
1 $ cat /etc/passwd | grep Freddy
2 Freddy:x:502:502::/home/Freddy:/bin/bash
3 $ echo $UID
4 502
5 $ cd
6 $ mkdir floppy
7 $ su
8 Password:
9 # mount -t auto -o uid=502 /dev/fd0 /home/Freddy/floppy/
10 # exit
11 exit
12 $ ...
13 $ cd
14 $ su
15 Password:
16 # umount /home/Freddy/floppy
17 # exit
18 exit
19 $
A.6.2 CD-ROM
Mounting the CD-ROM drive is pretty much like mounting a floppy-disk drive
(see Sect. A.6.1 on the page before).
Terminal 170: Mount CD-ROM
1 $ cd
2 $ mkdir cdrom
270 A Appendix
3 $ su
4 Password:
5 # mount -t iso9660 /dev/cdrom /home/Freddy/cdrom
6 # exit
7 exit
8 $ ...
9 $ cd
10 $ su
11 Password:
12 # umount /home/Freddy/cdrom
13 # exit
14 exit
15 $
It is very common nowadays to have compact flash or other small memory me-
dia. I have had bad experience using USB adapters. However, with a PCMCIA
adapter I have no trouble to mount my compact flash card with
Of course, you are welcome to learn the codes; however, you might prefer to
look them up here...
It is quite important that you and I speak the same language when it comes
to special characters like parentheses, brackets and braces. Here is a list for
your orientation:
Space blank
# Crosshatch number sign, sharp, hash
$ Dollar Sign dollar, cash, currency symbol, string
% Percent Sign percent, grapes
& Ampersand and, amper, snowman, daemon
∗ Asterisk star, spider, times, wildcard, pine cone
, Comma tail
. Period dot, decimal (point), full stop
: Colon two-spot, double dot, dots
; Semicolon semi, hybrid
<> Angle Brackets angles, funnels
< Less Than less, read from
> Greater Than more, write to
= Equal Sign equal(s),
+ Plus Sign plus, add, cross, and, intersection
- Dash minus (sign), hyphen, negative (sign)
! Exclamation Point exclamation (mark), (ex)clam
? Question Mark question, query, wildchar
@ Vortex at, each, monkey (tail)
() Parentheses parens, round brackets, bananas
( Left Parentheses open paren, wane, parenthesee
) Right Parentheses close paren, wax, unparenthesee
[] Brackets square brackets, edged parentheses
[ Left Bracket bracket, left square bracket, opensquare
] Right Bracket unbracket, right square bracket, unsquare
{} Braces curly braces
{ Left Brace brace, curly, leftit, embrace, openbrace
} Right Brace unbrace, uncurly, rytit, bracelet, close
/ Slash stroke, diagonal, divided-by, forward slash
\ Backslash bash, (back)slant, escape, blash
^ Circumflex caret, top hat, cap, uphat, power
" Double Quotes quotation marks, literal mark, rabbit ears
’ Single Quotes apostrophe, tick, prime
` Grave accent, back/left/open quote, backprime
∼ Tilde twiddle, wave, swung dash, approx
Underscore underline, underbar, under, blank
| Vertical Bar pipe to, vertical line, broken line, bar
Solutions
Before you peek in here, try to solve the problems yourself. In some cases,
especially for the programming exercises, you might need to spend an hour
or so. Some scripts behave strangely if input files are in DOS format. In
these cases use the dos2unix filename command to convert the file (see
Sect. A.4 on page 267).
Solutions to Chapter 3
3.1 Take a look at Sect. 3.1 on page 27.
3.2 Take a look at Sect. 3.2.2 on page 32.
3.3 Type date > the date, then date >> the date and finally
cat the date
3.4 Use passwd
+D
3.5 Use: exit or logout or Ctrl
Solutions to Chapter 4
4.1 Use: cd or cd ∼; pwd; ls -a; ls -al
4.2 Use: pwd; cd ..; cd /; cd or cd ∼
4.3 Use: cd or cd ∼; mkdir testdir; mkdir testdir/subdir;
ls -a testdir/subdir; rm -r testdir/subdir
4.4 Amazing how many directories there are, isn’t it?
4.5 Use: ls -l /; ls -l /etc/passwd; probably you have all rights on /etc
and read-only right on /etc/passwd ; cat /etc/passwd
274 Solutions
Solutions to Chapter 5
5.1 Use the same commands described in Sect. 5.1 on page 53.
5.2 Use: gunzip tacg-3.50-src.tar.gz; tar xf tacg-3.50-src.tar.gz
5.3 Use: .configure and then make -j2
5.4 Use cat or less for file viewing (see Chap. 6 on page 63).
Solutions to Chapter 6
6.1 Use: cat > fruits.txt and cat >> fruits.txt. Stop input with
+D
Ctrl .
6.2 Use: cat > vegetable; cat fruits.txt vegetable > dinner
6.3 Use: sort dinner
6.4 Take a look at Sect. 6.3 on page 72.
Solutions 275
Solutions to Chapter 7
7.1 Use: chsh and later enter /bin/bash
7.2 Use: ps | sort +4 -n | tee filename.txt
Solutions to Chapter 8
8.1 Hey, do not waste your time – play with the code!
Solutions to Chapter 9
9.1 I cannot help you here...
9.2 Use: egrep ’^[^]+[^]+[^]+$’ file.txt
9.3 Depending on your system use: egrep -e ’-[0-9]+’ file.txt or egrep
-e ’\-[0-9]+’ file.txt Note the use of the ’-e’ switch. Without it, egrep
interprets the leading minus character in the regular expression as a switch
indicator.
9.4 Use: egrep ’-?([0-9]+\.?[0-9]∗|[0-9]∗\.[0-9]+)’ file.txt Here the
minus is okay because it is preceded by a space character.
9.5 Use: egrep ’ATG[ATGC]{20,}TAA’ seq.file
9.6 Use: egrep ’hydrogenase’ file.txt
9.7 Use: egrep ’GT.?TACTAAC.?AG’ seq.file
9.8 Use: egrep ’G[RT]VQGVGFR.{13}[DW]V[CN]N{3}G’ sep.file since the N
stands for any amino acid you can replace all Ns by [GPAVLIMCFYWHKRQNEDST]
9.9 Use: ls -l | egrep ’^.{7}r’ or ls -l | egrep ’^.......r’
Solutions to Chapter 10
10.1 Use: sed ’s/Beisel/Weisel/’ structure.pdb
10.2 Use: sed ’1,3d’ structure.pdb
10.3 Use: sed -n ’5,10p’ structure.pdb or sed -e ’1,4d’ -e ’11,$d’
structure.pdb
10.4 Use: sed ’/MET/d’ structure.pdb
10.5 Use: sed -n ’/HELIX.∗ILE/p’ structure.pdb
10.6 Use: sed ’/^H/s/$/∗ ∗ ∗/’ structure.pdb
276 Solutions
Solutions to Chapter 11
11.1 Use: awk ’BEGIN{FS=","}{for (i=NF; i>0; i--){out=out$i" - "}
print out}’ numbers. txt
11.2
BEGIN{RS=";"; i=1
while (getline < "words.txt" >0){word[i]=$1; i++}
RS=","}
{print $1,":",word[NR]}
11.3
BEGIN{n=-1
for(i=1; i<51; i++){
n++; if(n==10){print ""; n=0}; printf("%2s ",i)}
print ""}
11.4 Use: awk ’{print $0, "-", $5/$8, "bp/gene"}’ genomes2.txt
11.5 Add to line 14: list=list" "key1; replace in line 16 key1 != key2
with index(list, key2)=0; less calculation time is needed
Solutions to Chapter 12
12.1 Add the following lines at the end of the program:
print "Translated Sequence:\n";
while (length(substr($DNA,0,3)) == 3){
$tri=substr($DNA,0,3,"");
print aa($tri)}
sub aa{
my($codon)=@_;
if ($codon =~ /GC[ATGC]/) {return "A"} # Ala
elsif ($codon =~ /TG[TC]/) {return "C"} # Cys
elsif ($codon =~ /GA[TC]/) {return "D"} # Asp
elsif ($codon =~ /GA[AG]/) {return "E"} # Glu
elsif ($codon =~ /TT[TC]/) {return "F"} # Phe
elsif ($codon =~ /GG[ATGC]/) {return "G"} # Gly
elsif ($codon =~ /CA[TC]/) {return "H"} # His
elsif ($codon =~ /AT[TCA]/) {return "I"} # Ile
elsif ($codon =~ /AA[AG]/) {return "K"} # Lys
Solutions 277
Commands for shell programming, awk or perl will be found at these entries.