0% found this document useful (0 votes)
7 views19 pages

Module 1 Session 2 Part 1 Linux

The document provides an introduction to the Stream Editor (SED) for text processing in bioinformatics, particularly for genomic data. It covers basic syntax, commonly used commands, and practical examples of using SED for tasks such as substitution, deletion, and line extraction. Additionally, it discusses special characters for enhanced pattern matching and provides resources for further learning about SED.

Uploaded by

jackson.sembera
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views19 pages

Module 1 Session 2 Part 1 Linux

The document provides an introduction to the Stream Editor (SED) for text processing in bioinformatics, particularly for genomic data. It covers basic syntax, commonly used commands, and practical examples of using SED for tasks such as substitution, deletion, and line extraction. Additionally, it discusses special characters for enhanced pattern matching and provides resources for further learning about SED.

Uploaded by

jackson.sembera
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Genomics Sequencing Bioinformatics

Africa Course 2023

Introduction to Linux
Session 2 – Part 1 – SED

Current Attribution:https://github.com/WCSCourses/GSBAfrica2023
Original Attribution: https://github.com/WTAC-NGS
Stream Editor(SED)

• Useful to look for patterns one line at a time, like grep


• Can be used to change lines of the file e.g. Some characters or patterns
• Useful to print certain lines in a file which can be used for another software
program or to check content

• SED is very useful for finding, substituting and formatting text and files e.g.,
fasta headers

• Non-interactive text editor Editing commands come in as script


• A Unix filter i.e., Superset of previously mentioned tools

Current Attribution:https://github.com/WCSCourses/GSBAfrica2023
Original Attribution: https://github.com/WTAC-NGS
Basic SED syntax

sed OPTIONS... [SCRIPT] [INPUTFILE...]/


 sed ‘s/pattern to find/pattern to replace/’ input_file
 The s after sed in the command is for substitution
 E.g.,seds/chr/Chromosome/'practical/Notebooks/awk/genes.gff
 sed has a couple of default ways of working:
 sed reads in the file looks for matches of the pattern line by line
 The output is sent to the standard output / screen line by line

Current Attribution:https://github.com/WCSCourses/GSBAfrica2023
Original Attribution: https://github.com/WTAC-NGS
Commonly Used SED Commands
Let's take a look at some of the commonly used SED commands:
•s: Substitute command. Used for find and replace. Syntax:
s/pattern/replacem Commonly Used SED Commands ent/.
•d: Delete command. Deletes lines from the input.
•p: Print command. Prints lines from the input.
•i: Insert command. Inserts text before a line.
•a: Append command. Appends text after a line.
•y: Transliterate command. Changes characters in a given set to
another set.
•q: Quit command. Exits SED after processing a specific line.
•=: Displays the current line number.
Current Attribution:https://github.com/WCSCourses/GSBAfrica2023
Original Attribution: https://github.com/WTAC-NGS
Basic SED syntax

● Stream of characters modified can be redirected from the screen to file


for use
● One generally redirects the output from the screen to a new file
● sed ‘s/pattern to find/pattern to replace/’ input_file > output_file
● E.g., sed 's/chr/Chromosome/' practical/Notebooks/awk/genes.gff
> sed_output_genes.gff

Current Attribution:https://github.com/WCSCourses/GSBAfrica2023
Original Attribution: https://github.com/WTAC-NGS
Basic SED syntax
● Say I need to format the genes.gff file from its current tab delimited format to a comma separated
one for another program
● sed ‘s/\t/,/' practical/Notebooks/awk/genes.gff
chr1,source1 gene 100 300 0.5 + 0
name=gene1;product=unknown
chr1,source2 gene 1000 1100 0.9 - 0
name=recA;product=RecA protein
chr1,source5 repeat 10000 14000 1 + .
name=ALU

Current Attribution:https://github.com/WCSCourses/GSBAfrica2023
Original Attribution: https://github.com/WTAC-NGS
Basic SED syntax

● Only the first tab was substituted by a comma, and not the rest of the tabs

● Why?

Current Attribution:https://github.com/WCSCourses/GSBAfrica2023
Original Attribution: https://github.com/WTAC-NGS
Basic SED syntax

● Recall SED works by reading lines and matching to the


first pattern it finds in the line
● For the sed 's/chr/Chromosome/’ example – chr appears
once on each new line
● For the sed ‘s/\t/,/’ – the tab character appears multiple
times on each new line
● sed’s default behaviour is to substitute the first match on
each new line
Current Attribution:https://github.com/WCSCourses/GSBAfrica2023
Original Attribution: https://github.com/WTAC-NGS
Let's see some practical examples of using SED:

● sed 's/old_text/new_text/' input.txt (replaces a line)


● sed '/pattern/d' input.txt (deletes a line)
● sed -n '5,10p' input.txt(prints specific line)
● sed -f script.sed filename( runs script file)
● sed -e 's/old/new/g' -e '/pattern/d' filename(combining
commands )

Current Attribution:https://github.com/WCSCourses/GSBAfrica2023
Original Attribution: https://github.com/WTAC-NGS
Basic SED syntax

● What if we want to replace all the matches to the pattern regardless of the
number of times it appears in a new line?
● Use the global flag
● E.g., sed 's/\t/,/g' practical/Notebooks/awk/genes.gff
chr1,source1,gene,100,300,0.5,+,0,name=gene1;product=unknown
chr1,source2,gene,1000,1100,0.9,-,0,name=recA;product=RecA protein
chr1,source5,repeat,10000,14000,1,+,.,name=ALU
chr2,source2,gene,10000,1200,0.95,+,0

Current Attribution:https://github.com/WCSCourses/GSBAfrica2023
Original Attribution: https://github.com/WTAC-NGS
Counting and extracting lines with SED

● One can use sed to print out specific lines in a file e.g a
row
● Say I wanted to extract lines 1, 2 and 3 only from the
genes.gff file
● sed '1,3p' practical/Notebooks/awk/genes.gff
● Notice that the substitution flag and slashes are not
present as we are just extracting lines, not matching and
modifying any characters
Current Attribution:https://github.com/WCSCourses/GSBAfrica2023
Original Attribution: https://github.com/WTAC-NGS
Counting and extracting lines with SED
sed '1,3p' practical/Notebooks/awk/genes.gff
chr1 source1 gene 100 300 0.5 + 0
name=gene1;product=unknown
chr1 source1 gene 100 300 0.5 + 0
name=gene1;product=unknown
chr1 source2 gene 1000 1100 0.9 - 0
name=recA;product=RecA protein
chr1 source2 gene 1000 1100 0.9 - 0
name=recA;product=RecA protein
chr1 source5 repeat 10000 14000 1 + .
name=ALU
chr1 source5 repeat 10000 14000 1 + .
name=ALU
chr2 source2 gene 10000 1200 0.95 + 0
chr2 source1 gene 50 900 0.4 - 0
name=gene2;product=gene2 protein
chr3 source1 gene 200 210 0.8 . 0
name=gene3
chr4 source3 repeat 300 400 1 + .
name=ALU
chr10 source2 repeat 60 70 0.78 + .
name=LINE1
chr10 source2 repeat 150 166 0.84 + .
name=LINE2
chrX source1 gene 123 456 0.6 + 0 n
ame=gene4;product=unknown

Current Attribution:https://github.com/WCSCourses/GSBAfrica2023
Original Attribution: https://github.com/WTAC-NGS
Counting and extracting lines with SED

● In this case sed printed out the whole file and added lines 1 and then 3 within
the file
● Why?

Current Attribution:https://github.com/WCSCourses/GSBAfrica2023
Original Attribution: https://github.com/WTAC-NGS
Counting and extracting lines with SED
● Recall sed’s default behavior is to print everything out onto the screen
● We can use the -n option to prevent sed’s default behaviour of printing everything
to the screen
● E.g sed -n '1,3p' practical/Notebooks/awk/genes.gff
chr1 source1 gene 100 300 0.5 + 0
name=gene1;product=unknown
chr1 source2 gene 1000 1100 0.9 - 0
name=recA;product=RecA protein
chr1 source5 repeat 10000 14000 1 + . name=ALU

Current Attribution:https://github.com/WCSCourses/GSBAfrica2023
Original Attribution: https://github.com/WTAC-NGS
Counting and extracting lines with SED
● “sed -n ‘1,3,p' practical/Notebooks/awk/genes.gff” prints out a range of lines from 1 to 3
● How do I get it to print out specific lines e.g 1-3 and then 5 and 7
● sed -n ‘1,3p; 5p; 7p' practical/Notebooks/awk/genes.gff
chr1 source1 gene 100 300 0.5 + 0
name=gene1;product=unknown
chr1 source2 gene 1000 1100 0.9 - 0
name=recA;product=RecA protein
chr1 source5 repeat 10000 14000 1 + .
name=ALU
chr2 source1 gene 50 900 0.4 - 0
name=gene2;product=gene2 protein
chr4 source3 repeat 300 400 1 + .
name=ALU

Current Attribution:https://github.com/WCSCourses/GSBAfrica2023
Original Attribution: https://github.com/WTAC-NGS
Special characters for SED

● I mainly use sed for pattern matching and substitution and formatting of files
● Sed provides a number of useful characters to provide more control over its
pattern matching:
● ^ match the start of the line
● $ match the end of the line
● [a-z] characters of the alphabet – used to change cases using the U& (for
upper case) and L& for lower case (note can also use the unix command tr for
case changing)
● sed -n 'p;n’ – print out odd number lines
● sed -n ‘n;p’ – print out even number lines

Current Attribution:https://github.com/WCSCourses/GSBAfrica2023
Original Attribution: https://github.com/WTAC-NGS
Special characters for SED
● sed 's/^/Organism_/g' genes.gff

Organism_chr1 source1 gene 100 300 0.5 + 0


name=gene1;product=unknown
Organism_chr1 source2 gene 1000 1100 0.9 - 0
name=recA;product=RecA protein
● sed 's/$/_Organism/g' genes.gff
chr1 source1 gene 100 300 0.5 + 0
name=gene1;product=unknown_Organism
chr1 source2 gene 1000 1100 0.9 - 0
name=recA;product=RecA protein_Organism
● sed 's/[a-z]/\U&/g' genes.gff

● CHR1 SOURCE1 GENE 100 300 0.5 + 0


NAME=GENE1;PRODUCT=UNKNOWN
● CHR1 SOURCE2 GENE 1000 1100 0.9 - 0
NAME=RECA;PRODUCT=RECA PROTEIN

Current Attribution:https://github.com/WCSCourses/GSBAfrica2023
Original Attribution: https://github.com/WTAC-NGS
Special characters for SED
● For more specific control, can use the pattern to be matched e.g.
● sed ‘s/^chr*/Organism_/g' genes.gff
Organism_chr1 source1 gene 100 300 0.5 + 0
name=gene1;product=unknown
Organism_chr1 source2 gene 1000 1100 0.9 - 0
name=recA;product=RecA protein
● sed 's/$/_Organism/g' genes.gff
chr1 source1 gene 100 300 0.5 + 0
name=gene1;product=unknown_Organism
chr1 source2 gene 1000 1100 0.9 - 0
name=recA;product=RecA protein_Organism
● sed 's/[a-z]/\U&/g' genes.gff
● CHR1 SOURCE1 GENE 100 300 0.5 + 0
NAME=GENE1;PRODUCT=UNKNOWN
● CHR1 SOURCE2 GENE 1000 1100 0.9 - 0
NAME=RECA;PRODUCT=RECA PROTEIN

Current Attribution:https://github.com/WCSCourses/GSBAfrica2023
Original Attribution: https://github.com/WTAC-NGS
More info and examples on using SED (syntaxes / usage
might differ)
● https://bioinformaticsworkbook.org/Appendix/Unix/unix-
basics4sed.html#gsc.tab=0
● https://dasher.wustl.edu/chem478/software/unix-tools/sed.html
● https://www.grymoire.com/Unix/Sed.html
● https://gist.github.com/ssstonebraker/6140154

Current Attribution:https://github.com/WCSCourses/GSBAfrica2023
Original Attribution: https://github.com/WTAC-NGS

You might also like