Sysad313: Text Processing Filters in Linux: Objectives
Sysad313: Text Processing Filters in Linux: Objectives
Sysad313: Text Processing Filters in Linux: Objectives
Objectives: Introduce the concept of filters, pipes and redirection Demonstrate text manipulation using filters Learn how to display text, sort, count words and lines and translate characters
Introduction Most of the time, a Linux command can give you or display too much information that make it more complicated to interpret by the user. Using filtering utilities available in Linux, one can manipulate the output of a particular command into a format or structure which best suit the users preferences. Text Filtering Text filtering is the process of converting text from an input of text stream before sending it to an output stream. Input and output can either come from a file but most of the time, filtering is done through the use of a pipeline of commands where the output from one command is piped or redirected to be used as input to the next. Stream A stream is a sequence of bytes read or written to using library functions that hide the details of an underlying device from the application. Programs are able to read from, write to a terminal, file, or network socket in a device-independent way using streams. Three standard I/O streams are used by modern shell today: stdin is the standard input stream which provides input to commands stdout is the standard output stream, which displays output from commands stderr is the standard error stream, which displays error output from commands
File descriptors for stdin, stdout and stderr: Standard input (stdin) - 0 Standard output (stdout) - 1 Standard Error (stderr) - 2
Piping with | Commands take on input from the parameters given by the users and output are displayed on your terminal. Filter can take input either from standard input stream or from a file. In order to use the
output of a command (e.g. command1) as an input to a filter (command2), you connect the commands using the pipe operator ( | ) Example: $ echo e apple\npear\nbanana | sort apple banana pear
You can also use | to redirect the output of the second command in the pipeline to a third command and so on. You will also see a hyphen ( - ) used in place of a filename as an argument to a command, meaning the input should come from the stdin rather than a file. Output Redirection with > There are times that you want to save the output to a file. You do this by the output redirection operator ( > ) . Example: $ mkdir practice $ cd practice $ echo e 1 apple\n2 pear \nbanana > text.txt $ less text.txt Input Redirection with < The general use of input redirection is when you have some kind of file, which you have ready and now you would like to use some command on that file. You cannot have input redirection with any command. Only those commands that accept input from the keyboard could be redirected to use some kind of text files as their input. Error Redirection A lot of commands you type, you get a lot of error messages. You are not really bothered about those error messages o Whenever you perform a search for file , you always get a lot of permission denied error messages o The simplest way is to redirect elsewhere Example: $ find . name *.c 2> errorfile Example: Redirecting the standard error and stdout to file $ command &> file $ command > file 2>&1
Counting Words and Lines with wc wc counts characters, words and lines in a file If used with multiple files, outputs counts for each file and a combined total Options: o c : output character count o l :output line count o w: output word count o Default -clw Example: display word count for novel.txt $ wc w novel.txt Example: Display the total number of lines in several text files $ wc l *.txt Sorting Lines of Text with sort The sort filter read lines of texts and prints them sorted into order Options o f option makes sorting case-insensitive o n option sorts numerically, rather than lexicographically Example $ sort words > sorted-words Removing Duplicate Lines with uniq Use uniq to find unique lines in a file o Removes consecutive duplicate lines o Usually give it sorted input , to remove all duplicates Example: Find out how many unique words are in the dictionary $ sort /usr/dict/words | uniq | wc w Selecting Parts of Lines With cut Used to select columns or fields from each line of input Select a range of o Characters, with c o Fields, with f Field separator specified with d (defaults to tab) A range is written as start and end position: e.g., 3-5 o Either can be omitted o The first character or field is numbered 1 not 0 Example: Select usernames of logged in users $ who | cut d f1 | sort - u Expanding Tabs to Spaces with expand Used to replace tabs with spaces in files
Tab size (maximum number of spaces for each tab) can be set with t number o Default tab size is 8 To only change tabs at the beginning of lines, use i Example: change all tabs in foo.txt to three spaces, display it to screen: $ expand t 3 example.txt $ expand -3 example.txt Using fmt to Format Text Files Arranges words nicely into lines of consistent length Options o u convert to uniform spacing One space between words , two between sentences o w width to set the maximum line width in characters Defaults to 75 Example: Change the line length of example.txt to a maximum of 70 characters, and display it on the screen $ fmt w 70 notes.txt | less Reading The Start Of A File With head Prints the top of its input and discard the rest Set the number of lines to print with n lines or lines o Defaults to ten lines Example: View the headers of a HTML document called homepage.html $ head homepage.html Example: Print the first line of a text file $ head n 1 notes.txt $ head -1 notes.txt Reading The End Of The File with tail Similar to head, but prints lines at the end of a file The f option watches the file forever o Continually updates the display as new entries are appended to the end of the file o Terminate it with, Ctrl+C The option n is the same as in head (number of lines to print) Example: monitor HTTP requests on a webserver $ tail -f /var/log/httpd/access_log
Number Lines Of File with nl or cat Display the input with line numbers against each line There are options to finely control the formatting By default, blank lines arent numbered o The option ba numbers every line o cat n also number lines, including blank ones
Dumping Bytes of Binary Data with od Prints the numeric values of the byte in a file. Useful for studying files with non-text characters By default, prints two-byte words in octal Specify an alternative with the t option o o: octal o x: hexadecimal o u: unsigned o Can be followed by the number of bytes per word o Add z to show ASCII equivalents alongside the numbers o A useful format is given by od t x1z hexadecimal, one byte words, with ASCII Paginating Text Files with pr Convert text file into a paginated text, with headers and page fills Options: o d : double spaced output o h header: change from the default header to header o l lines: change the default lines on a page from 66 to lines o o width: set (offset) the left margin to width Example: $ pr h My Thesis thesis.txt | lpr Dividing Files into Chunks with split Splits files into equal sized segments Syntax: split [options] [input] [output-prefix] Use: o l n: split a file into n-line chunks o b n: split into chunks of n bytes each Output files are named using the specified output name with aa, ab, ac, etc., added to the end of the prefix Example: Split essay.txt into 30-line files, and save the output to files short_aa, short_ab, etc: $ split l 30 essay.txt short_ Reversing Files with tac Similar to cat, but in reverse Prints the last line of the input first, the penultimate line second and so on Example: show a list of logins and logouts, but with the most recent events at the end $ last | tac Translating Sets of Characters with tr Translate one set of characters to another Syntax: tr start-set end-set Replaces all characters in start-set with corresponding characters in end-set
Cannot accept a file as an argument, but uses the standard input and output Options: o d: deletes characters in start-set instead of translating them o s: replaces sequences of identical characters with just one Example: Replace all uppercase characters in input-file with lower case characters $ cat input-file | tr A-Z a-z $ tr A-Z a-z < input-file Example: Delete all occurrences of z in story.txt $ cat story.txt | tr d z Example: Run together each sequence of repeated f characters in lullaby.txt to with just one f: $ tr s f < lullaby.txt Put Files Side by Side with paste paste takes line from two or more files and puts them in columns of the output Use d char to set the delimiter between fields in the output. o The default is tab o Giving d more than one character sets different delimiters between each pair of columns Example: assign passwords to users, separating them with a colon $ paste d : usernames passwords > .htpasswd Performing Database Joins with join Does a database-style inner join on two tables, stored in text files The t option sets the field delimiter o By default, fields are separated by any number of spaces or tabs The input files must be sorted Example: show details of suppliers and their products $ join suppliers.txt products.txt | less