Ict-innovation/LPI/103.2
103.2 Process text streams using filters
[edit | edit source]Candidates should be able to apply filters to text streams.
Key Knowledge Areas
- Send text files and output streams through text utility filters to modify the output using standard UNIX commands found in the GNU textutils package.
Text Processing Utilities
[edit | edit source]Linux has a rich assortment of utilities and tools for processing and manipulating text files. In this section we cover some of them.
cat - cat is short for concatenate and is a Linux command used to write the contents of a file to standard output. Cat is usually used in combination with other command to perform manipulation of the file or if you wish to quickly get an idea of the contents of a file. The simplest format of the command Is:
# cat /etc/aliases |
Cat can take several parameters, the most commonly used being -n and -b which output line numbers on all lines and non-empty lines only respectively.
head and tail - The utilities head and tail are often used to examine log files. By default they output 10 lines of text. Here are the main usages.
List 20 first lines of /var/log/messages:
# head -n 20 /var/log/messages # head -20 /var/log/messages |
List 20 last lines of /etc/aliases:
# tail -20 /etc/aliases |
The tail utility has an added option that allows one to list the end of a text starting at a given line.
List text starting at line 25 in /var/log/messages:
# tail +25 /etc/log/messages |
Finally tail can continuously read a file using the -f option. This is most useful when you are examining live log files for example.
wc -The wc utility counts the number of bytes, words, and lines in files. Several options allow you to control wc's output.
-l | count number of lines |
-w | count number of words |
-c or -m | count number of bytes or characters |
nl - The nl utility has the same output as cat -b
Number all lines including blanks:
# nl -ba /etc/lilo.conf |
Number only lines with text:
# nl -bt /etc/lilo.conf |
expand/unexpand - The expand command is used to replace TABs with spaces. One can also use unexpand for the reverse operations.
od There are a number of tools available for this. The most common ones are od (octal dump) and hexdump.
split - splitting files - The split tool can split a file into smaller files using criteria such as size or number of lines. For example we can spilt /etc/passwd into smaller files containing 5 lines each
# split -l 5 /etc/passwd |
This will create files called xaa, xab, xac, xad ... each file contains at least 5 lines. It is possible to give a more meaningful prefix name for the files (other than x) such as passwd-5. on the command line
# split -l 5 /etc/passwd passwd-5 |
This has created files identical to the ones above (aa, xab, xac, xad ...) but the names are now passwd-5aa, passwd-5ab, passwd-5ac, passwd-5ad …
Erasing consecutive duplicate lines
The uniq tool will send to stdout only one copy of consecutive identical lines.
Consider the following example:
# uniq > /tmp/UNIQUE line 1 line 2 line 2 line 3 line 3 line 3 line 1 ^D |
The file /tmp/UNIQUE has the following content:
# cat /tmp/UNIQUE line 1 line 2 line 3 line 1 |
NOTE:
From the example above we see that when using uniq non consecutive identical lines are still printed to STDOUT. Usually the output is sorted first so that identical lines all appear together.
# sort | uniq > /tmp/UNIQUE |
cut The cut utilility can extract a range of characters or fields from each line of a text. The –c option is used to cut based on character positions.
Syntax:
cut {range1,range2}
Example:
# cut –c5-10,15- /etc/password
The example above outputs characters 5 to 10 and 15 to end of line for each line in /etc/password. One can specify the field delimiter (a space, a commas etc ...) of a file as well as the fields to output. These options are set with the –d and –f flags respectively.
Syntax:
{delimiter} -f {fields}
Example:
# cut -d: -f 1,7 --output-delimiter=" " /etc/passwd
This outputs fields 1 and 7 of /etc/passwd delimited with a space. The default output-delimiter is the same as the original input delimiter. The --output-delimiter option allows you to change this.
paste/join - The easiest utility is paste, which concatenates two files next to each other.
Syntax:
paste text1 text2
With join you can further specify which fields you are considering.
Syntax:
join -j1 {field_num} -j2{field_num} text1 text2 or
join -1 {field_num} -2{field_num} text1 text2
Text is sent to stdout only if the specified fields match. Comparison is done one line at a time and as soon as no match is made the process is stopped even if more matches exist at the end of the file.
sort - By default, sort will arrange a text in alphabetical order. To perform a numerical sort use the -n option.
Formatting output with fmt and pr
[edit | edit source]fmt is a simple text formatter that reformats text into lines of a specified length.
You can modify the number of characters per line of output using fmt. By default fmt will concatenate lines and output 75 character lines.
fmt options
-w number of characters per line
-s split long lines but do not refill -u place one space between each word and two spaces at the end of a sentence |
---|
Long files can be paginated to fit a given size of paper with the pr utility. Text is broken into pages of a specified length and page headers are added. One can control the page length (default is 66 lines) and page width (default 72 characters) as well as the number of columns.
pr can also produce multi-column output.
When outputting text to multiple columns each column will be evenly truncated across the defined page width. This means that characters are dropped unless the original text is edited to avoid this.
tr The tr utility translates one set of characters into another.
Example changing uppercase letters into lowercase
tr 'A-B' 'a-b' < file.txt |
---|
Replacing delimiters in /etc/passwd:
# tr ':' ' ' < /etc/passwd |
NOTE: tr has only two arguments!.
sed sed stands for stream editor and is used to manipulate text stream tr will not read from a file, it only reads standard input. It is most commonly used to transform text input generated by other commands in bash scripts. sed is a complex tool that can take some time to master. It's most common use case is to find and replace text in an input stream. Sed's output is written to standard out, with the original file left untouched, and needs to be redirected to a file to make the changes permanent.
The command:
# sed ‘s/linux/Linux/g‘ readme.txt > ReadMe.txt |
will replace every occurrence of the word linux with Linux in the readme.txt file. The g at the end of the command is used to make the replacement global so sed will process the entire line and not stop at the first occurrence of the word linux. For more informaiton on sed refer to section 103.7
Used files, terms and utilities:* cat
- cut
- expand
- fmt
- head
- od
- join
- nl
- paste
- pr
- sed
- sort
- split
- tail
- tr
- unexpand
- uniq
- wc