Introduction to Bioinformatics “CSCI-471”
Revision for what taken last lab:
1- Determine your path: By using print working directory (pwd)
2- Change your path: By using change directory (cd)
3- Move to Documents and make 2 new folders (lab/lecture): By using mkdir lab lecture
4- Move to lab folder and make txt file called lab tutorial to type some sequences: By using
-cd lab
-cat > tutorial.txt (to make a file and type inside it)
AAAAAACCTGG
GGTCACTGGTA
- cat tutorial.txt (to show its contents)
- cat >> tutorial.txt (to append some data inside this file)
ACGTGGGCCGT
-cat tutorial.txt (to show all its components)
AAAAAACCTGG
GGTCACTGGTA
ACGTGGGCCGT
5- Move to lecture folder (by relative path): cd ../lecture/
6- Make 2 txt files inside lecture folder: touch tutorial2.txt tutorial3.docx
7- To determine the components of lecture folder: use ls + its arguments (known by use man
ls)
8- Return back to lab folder (by absolute path): cd ~/Documents/lab
9- To determine any details about any commands: man ls or ls --help or google it
This tutorial: case study “change Fastq to Fasta”
#Download Data: By using
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR000/ERR000001/ERR000001_1.fastq.gz
If wget is not found on PC use
yum install wget (or) sudo apt-get install wget
# To determine the components with their space: ls –lh (30M)
# File compression and decompression:
gunzip ERR000001_1.fastq.gz
# To determine the components with their space: ls -lh (130M)
# Display the Contents of a File
cat ERR000001_1.fastq
more ERR000001_1.fastq
less ERR000001_1.fastq
head ERR000001_1.fastq
tail ERR000001_1.fastq
#Count the Number of Lines
wc ‐l ERR000001_1.fastq
#Search a Pattern (don't determine the read)
grep "CCCCCTTAAAAA" ERR000001_1.fastq
#combine multiple commands
grep “CCCCCTTAAAAA” ERR000001_1.fastq | wc -l
#Converting a FASTQ File into a Tabular Format
cat ERR000001_1.fastq | paste - - - - > ERR_tab.txt
# to determine the difference between both file fastq file and its tabular: head both
# search pattern again (determine the read)
grep "CCCCCTTAAAAA" ERR_tab.txt
#Pattern Matching Using Awk
Its format: awk ‘/pattern to search/ {Actions}’ filename [ awk here make like grep ]
awk '/CCCCCTTAAAAA/ {print $0}' ERR_tab.txt ( $0: print all record) (determine the read)
# To print the first and third record (header and sequence)
awk '/CCCCCTTAAAAA/ {print $1 "\t" $3}' ERR_tab.txt
# To print the sequence and quality score?????? (try by yourself “assignment”)
# To determine which sequences has N
awk '{if($3~"N") print $1 "\t" $3}' ERR_tab.txt #to determine how many sequences????
(try by yourself “assignment”)
#Sort and Extract Unique Sequences
cat ERR_tab.txt | sort -k 3 > ERR_sorted.txt (k to sort specific column here the third column
which is sequences)
#to get the unique sequences
cat ERR_tab.txt | sort -k 3 –u > ERR_unique.txt
# to determine the difference between the sorted and unique files: use wc -l
# Convert Reads into FASTA Format Sequences
awk '{print $1 "\t" $3}' ERR_tab.txt > ERR_allseqs.txt
sed 's/@/>/' ERR_allseqs.txt
head ERR_allseqs.txt
awk '{print $1, "\n" $2}' ERR_allseqs.txt > ERR_allseqs.fasta
head ERR_allseqs.fasta
References:
- http://www.yourownlinux.com/2014/01/linux-ls-command-tutorial-with-examples.html
- https://www.computerhope.com/unix/uls.htm
- https://www.computerhope.com/unix/ucd.htm
- http://kirste.userpage.fu-berlin.de/chemnet/use/info/gawk/gawk_3.html
- http://www.theunixschool.com/2012/08/linux-sort-command-examples.html
- https://www.computerhope.com/unix/used.htm
- Second chapter in Bioinformatics a practical handbook of next generation sequencing
and its applications.pdf