Genomics Sequencing Bioinformatics
Africa Course 2023
Introduction to Linux
Session 2 – Part 1 – SED
Current Attribution:https://github.com/WCSCourses/GSBAfrica2023
Original Attribution: https://github.com/WTAC-NGS
Stream Editor(SED)
• Useful to look for patterns one line at a time, like grep
• Can be used to change lines of the file e.g. Some characters or patterns
• Useful to print certain lines in a file which can be used for another software
program or to check content
• SED is very useful for finding, substituting and formatting text and files e.g.,
fasta headers
• Non-interactive text editor Editing commands come in as script
• A Unix filter i.e., Superset of previously mentioned tools
Current Attribution:https://github.com/WCSCourses/GSBAfrica2023
Original Attribution: https://github.com/WTAC-NGS
Basic SED syntax
sed OPTIONS... [SCRIPT] [INPUTFILE...]/
sed ‘s/pattern to find/pattern to replace/’ input_file
The s after sed in the command is for substitution
E.g.,seds/chr/Chromosome/'practical/Notebooks/awk/genes.gff
sed has a couple of default ways of working:
sed reads in the file looks for matches of the pattern line by line
The output is sent to the standard output / screen line by line
Current Attribution:https://github.com/WCSCourses/GSBAfrica2023
Original Attribution: https://github.com/WTAC-NGS
Commonly Used SED Commands
Let's take a look at some of the commonly used SED commands:
•s: Substitute command. Used for find and replace. Syntax:
s/pattern/replacem Commonly Used SED Commands ent/.
•d: Delete command. Deletes lines from the input.
•p: Print command. Prints lines from the input.
•i: Insert command. Inserts text before a line.
•a: Append command. Appends text after a line.
•y: Transliterate command. Changes characters in a given set to
another set.
•q: Quit command. Exits SED after processing a specific line.
•=: Displays the current line number.
Current Attribution:https://github.com/WCSCourses/GSBAfrica2023
Original Attribution: https://github.com/WTAC-NGS
Basic SED syntax
● Stream of characters modified can be redirected from the screen to file
for use
● One generally redirects the output from the screen to a new file
● sed ‘s/pattern to find/pattern to replace/’ input_file > output_file
● E.g., sed 's/chr/Chromosome/' practical/Notebooks/awk/genes.gff
> sed_output_genes.gff
Current Attribution:https://github.com/WCSCourses/GSBAfrica2023
Original Attribution: https://github.com/WTAC-NGS
Basic SED syntax
● Say I need to format the genes.gff file from its current tab delimited format to a comma separated
one for another program
● sed ‘s/\t/,/' practical/Notebooks/awk/genes.gff
chr1,source1 gene 100 300 0.5 + 0
name=gene1;product=unknown
chr1,source2 gene 1000 1100 0.9 - 0
name=recA;product=RecA protein
chr1,source5 repeat 10000 14000 1 + .
name=ALU
Current Attribution:https://github.com/WCSCourses/GSBAfrica2023
Original Attribution: https://github.com/WTAC-NGS
Basic SED syntax
● Only the first tab was substituted by a comma, and not the rest of the tabs
● Why?
Current Attribution:https://github.com/WCSCourses/GSBAfrica2023
Original Attribution: https://github.com/WTAC-NGS
Basic SED syntax
● Recall SED works by reading lines and matching to the
first pattern it finds in the line
● For the sed 's/chr/Chromosome/’ example – chr appears
once on each new line
● For the sed ‘s/\t/,/’ – the tab character appears multiple
times on each new line
● sed’s default behaviour is to substitute the first match on
each new line
Current Attribution:https://github.com/WCSCourses/GSBAfrica2023
Original Attribution: https://github.com/WTAC-NGS
Let's see some practical examples of using SED:
● sed 's/old_text/new_text/' input.txt (replaces a line)
● sed '/pattern/d' input.txt (deletes a line)
● sed -n '5,10p' input.txt(prints specific line)
● sed -f script.sed filename( runs script file)
● sed -e 's/old/new/g' -e '/pattern/d' filename(combining
commands )
Current Attribution:https://github.com/WCSCourses/GSBAfrica2023
Original Attribution: https://github.com/WTAC-NGS
Basic SED syntax
● What if we want to replace all the matches to the pattern regardless of the
number of times it appears in a new line?
● Use the global flag
● E.g., sed 's/\t/,/g' practical/Notebooks/awk/genes.gff
chr1,source1,gene,100,300,0.5,+,0,name=gene1;product=unknown
chr1,source2,gene,1000,1100,0.9,-,0,name=recA;product=RecA protein
chr1,source5,repeat,10000,14000,1,+,.,name=ALU
chr2,source2,gene,10000,1200,0.95,+,0
Current Attribution:https://github.com/WCSCourses/GSBAfrica2023
Original Attribution: https://github.com/WTAC-NGS
Counting and extracting lines with SED
● One can use sed to print out specific lines in a file e.g a
row
● Say I wanted to extract lines 1, 2 and 3 only from the
genes.gff file
● sed '1,3p' practical/Notebooks/awk/genes.gff
● Notice that the substitution flag and slashes are not
present as we are just extracting lines, not matching and
modifying any characters
Current Attribution:https://github.com/WCSCourses/GSBAfrica2023
Original Attribution: https://github.com/WTAC-NGS
Counting and extracting lines with SED
sed '1,3p' practical/Notebooks/awk/genes.gff
chr1 source1 gene 100 300 0.5 + 0
name=gene1;product=unknown
chr1 source1 gene 100 300 0.5 + 0
name=gene1;product=unknown
chr1 source2 gene 1000 1100 0.9 - 0
name=recA;product=RecA protein
chr1 source2 gene 1000 1100 0.9 - 0
name=recA;product=RecA protein
chr1 source5 repeat 10000 14000 1 + .
name=ALU
chr1 source5 repeat 10000 14000 1 + .
name=ALU
chr2 source2 gene 10000 1200 0.95 + 0
chr2 source1 gene 50 900 0.4 - 0
name=gene2;product=gene2 protein
chr3 source1 gene 200 210 0.8 . 0
name=gene3
chr4 source3 repeat 300 400 1 + .
name=ALU
chr10 source2 repeat 60 70 0.78 + .
name=LINE1
chr10 source2 repeat 150 166 0.84 + .
name=LINE2
chrX source1 gene 123 456 0.6 + 0 n
ame=gene4;product=unknown
Current Attribution:https://github.com/WCSCourses/GSBAfrica2023
Original Attribution: https://github.com/WTAC-NGS
Counting and extracting lines with SED
● In this case sed printed out the whole file and added lines 1 and then 3 within
the file
● Why?
Current Attribution:https://github.com/WCSCourses/GSBAfrica2023
Original Attribution: https://github.com/WTAC-NGS
Counting and extracting lines with SED
● Recall sed’s default behavior is to print everything out onto the screen
● We can use the -n option to prevent sed’s default behaviour of printing everything
to the screen
● E.g sed -n '1,3p' practical/Notebooks/awk/genes.gff
chr1 source1 gene 100 300 0.5 + 0
name=gene1;product=unknown
chr1 source2 gene 1000 1100 0.9 - 0
name=recA;product=RecA protein
chr1 source5 repeat 10000 14000 1 + . name=ALU
Current Attribution:https://github.com/WCSCourses/GSBAfrica2023
Original Attribution: https://github.com/WTAC-NGS
Counting and extracting lines with SED
● “sed -n ‘1,3,p' practical/Notebooks/awk/genes.gff” prints out a range of lines from 1 to 3
● How do I get it to print out specific lines e.g 1-3 and then 5 and 7
● sed -n ‘1,3p; 5p; 7p' practical/Notebooks/awk/genes.gff
chr1 source1 gene 100 300 0.5 + 0
name=gene1;product=unknown
chr1 source2 gene 1000 1100 0.9 - 0
name=recA;product=RecA protein
chr1 source5 repeat 10000 14000 1 + .
name=ALU
chr2 source1 gene 50 900 0.4 - 0
name=gene2;product=gene2 protein
chr4 source3 repeat 300 400 1 + .
name=ALU
Current Attribution:https://github.com/WCSCourses/GSBAfrica2023
Original Attribution: https://github.com/WTAC-NGS
Special characters for SED
● I mainly use sed for pattern matching and substitution and formatting of files
● Sed provides a number of useful characters to provide more control over its
pattern matching:
● ^ match the start of the line
● $ match the end of the line
● [a-z] characters of the alphabet – used to change cases using the U& (for
upper case) and L& for lower case (note can also use the unix command tr for
case changing)
● sed -n 'p;n’ – print out odd number lines
● sed -n ‘n;p’ – print out even number lines
Current Attribution:https://github.com/WCSCourses/GSBAfrica2023
Original Attribution: https://github.com/WTAC-NGS
Special characters for SED
● sed 's/^/Organism_/g' genes.gff
Organism_chr1 source1 gene 100 300 0.5 + 0
name=gene1;product=unknown
Organism_chr1 source2 gene 1000 1100 0.9 - 0
name=recA;product=RecA protein
● sed 's/$/_Organism/g' genes.gff
chr1 source1 gene 100 300 0.5 + 0
name=gene1;product=unknown_Organism
chr1 source2 gene 1000 1100 0.9 - 0
name=recA;product=RecA protein_Organism
● sed 's/[a-z]/\U&/g' genes.gff
● CHR1 SOURCE1 GENE 100 300 0.5 + 0
NAME=GENE1;PRODUCT=UNKNOWN
● CHR1 SOURCE2 GENE 1000 1100 0.9 - 0
NAME=RECA;PRODUCT=RECA PROTEIN
Current Attribution:https://github.com/WCSCourses/GSBAfrica2023
Original Attribution: https://github.com/WTAC-NGS
Special characters for SED
● For more specific control, can use the pattern to be matched e.g.
● sed ‘s/^chr*/Organism_/g' genes.gff
Organism_chr1 source1 gene 100 300 0.5 + 0
name=gene1;product=unknown
Organism_chr1 source2 gene 1000 1100 0.9 - 0
name=recA;product=RecA protein
● sed 's/$/_Organism/g' genes.gff
chr1 source1 gene 100 300 0.5 + 0
name=gene1;product=unknown_Organism
chr1 source2 gene 1000 1100 0.9 - 0
name=recA;product=RecA protein_Organism
● sed 's/[a-z]/\U&/g' genes.gff
● CHR1 SOURCE1 GENE 100 300 0.5 + 0
NAME=GENE1;PRODUCT=UNKNOWN
● CHR1 SOURCE2 GENE 1000 1100 0.9 - 0
NAME=RECA;PRODUCT=RECA PROTEIN
Current Attribution:https://github.com/WCSCourses/GSBAfrica2023
Original Attribution: https://github.com/WTAC-NGS
More info and examples on using SED (syntaxes / usage
might differ)
● https://bioinformaticsworkbook.org/Appendix/Unix/unix-
basics4sed.html#gsc.tab=0
● https://dasher.wustl.edu/chem478/software/unix-tools/sed.html
● https://www.grymoire.com/Unix/Sed.html
● https://gist.github.com/ssstonebraker/6140154
Current Attribution:https://github.com/WCSCourses/GSBAfrica2023
Original Attribution: https://github.com/WTAC-NGS