Awk-An Advanced Filter
Awk-An Advanced Filter
Theawkcommand made a late entry into the UNIX system in 1977to augment the tool kit with
suitablereport formatting capabilities. Named after its authors, Aho, Wéinbcrgerand Kernighan,
until the advent of perl, was the most powerful utility for text manipulation. Like sed, it
combinesfeatures of several filters, though its report writing capability is the most useful. awk
appears as gawk(GNU awk) in Linux.
doesn'tbelong to the do-one-thing-well family of UNIX commands. In fact, it can do several
things—andsome of them quite well. Unlike other filters, it operates at thefield level and can
easilyaccess,transform and format individual fields in a line. It also accepts extended regular
expressions(EREs) for pattern matching, has C-type programming constructs, variables and several
built-infunctions. We'll discuss the important awkfeatures in some detail because that will help
manner.
youin understanding perl , which uses most of the awkconstructs, sometimes in identical
WHAT YOU WILLLEARN
components.
• Understand awk'sunusual syntax with its selectioncriteria and action
Splita line into fields and format the output with pri ntf.
any condition.
• Usethe comparison operators to select lines on practically
(EREs) for pattern matching.
Usethe and ! —operators with extended regular expressions
• Handle decimal numbers and use them for computation.
• Usevariableswithout declaring or initializing them.
and END sections.
• Do somepre- and post-processing with the BEGIN
• Examineawk'sbuilt-in variables.
nonnumeric subscript.
• Usearrays and access an array element with a
handling tasks.
• Usethe built-in functions for performing string
its compact one-line conditional.
Makedecisions with the i f statement and
repeatedly
• Usethe for and while loops to perform tasks
386 UNIX:Concepts and Applications
Theselectioncriteria here translates to this: "Select those lincs where the third field doesn't (! z)
completelymatch the string di rector and (&&)also doesn't ( ! completely match the string chai rman.
that the match is made for the entire field, rather than a string embedded in the field space. Wc
couldn'tverifythat because the file empn.1st has all trailing spaces trimmed. But will a similar awk
work on the file emp.1st which has trailing spaces attached to the strings?
program
$3 "director" Il $3 "chai rman"' emp.lst
So,it won't; the third field contains trailing whitespace, so a perfect match wasn't found. Field
matchingis better done with regular expressions as you'll see in the next topic.
"director" {
printf "%-20s %-12s empn.lst
lalit chowdury
sengupta di rector
8200
7800
3280
3120
1230
barun 1170
di rector
jai shama
singhvi di rector
7000
6700
2800
2680
1050
chanchal 1005
21.6VARIABLES
whileawkhas certain built-in variables, like NRand $0,representing the line number and the entire
line,respectively,it also permits the user to use variables of her choice. A user-defined variable used
byawkhas two special features:
No typedeclarations are needed.
Bydefault,variables are initialized to zero or a null string, depending on its type. awkhas a
mechanism of identifying the type of variable used from its context.
cannow use these properties to print a serial number, using the variable kount, and apply it to
You
selectthose directors drawing a salary exceeding 6700:
$ awk-F" I " '$3 director" &&$6 > 6700 {
> kount= kount + 1
> printf "%3d %-20s %-12s kount,$2,$3,$6 } ' empn.lst
1 lalit chowdury di rector 8200
2 barun sengupta di rector 7800
3 jai sharma di rector 7000
Theinitialvalue of kount was 0 (by default). That's why the first line is correctly assigned the
numberl. awkalso accepts the C-style incrementing forms:
kount++ Same as kount - kount + 1
kount+= 2 Same as kount = kount + 2
printf ++kount Increments kount beforeprinting
Tip NO type declarations or initial values are required for user-defined variables used in an
awkprogram. awk identifies their type and initializes them to zero or null strings.
BEGIN {
printf oyee
) $6 > 7500 { # Increment the variables for the serial numberand the pay
kount++ ; tot+= $6 # Multiple assignments in one line
printf 3d %-20s %-12s
END {
printf averagebasic pay is tot/kount
21.9BUILT-IN VARIABLES
hasseveralbuilt-in variables (Table 21.2). They are all assigned automatically, though it is also
iblefor a user to reassign some of them. You have already used NR,
which signifies the record
numberof the current line. We'll now have a brief look at some of the other variables.
TheFSVariable As stated elsewhere, awkuses a contiguous string of spaces as the default field
delimiter.FSredefines this field separator, which in the sample database happens to be the l. When
usedat all, it must occur in the BEGINsection so that the body of the program knows its value before
it starts processing:
BEGIN{ FS l"
Thisis an alternative to the -F option of the command which does the same thing.
TheOFSVariable When you used the print statement with comma-separated arguments, each
argumentwas separated from the other by a space. This is awk'sdefault output field separator, and
canbereassigned using the variable OFSin the BEGINsection:
BEGIN { }
Whenyoureassign this variable with a (tilde), awkwill use this character for delimiting the pri nt
arguments.This is a useful variable for creating lines with delimited fields.
TheNFVariable NFcomes in quite handy in cleaning up a database of lines that don't contain the
nghtnumber of fields. By using it on a file, say empx.1st, you can locate those lines not having six
and which have crept in due to faulty data entry;
fields,
$ awk BEGIN { FS
> print "Record No " NR "has " , NF, " fields"} ' empx.lst
RecordNo 6 has 4 fields
RecordNo 17 has 5 fields
Applications
394 UNIX:Concepts and
the name ofthc current filc being processed
ftLENAME storcs command line. By default,
filcnamcs in thc Like
handle multiple
and sed, awkcan also
it to do so:
the filename, but you can instruct
'
print FILENAME, $0 )
4000 ( depending on the file
that docs different things being
With FILENAME,you can devise logic
BEGIN {
.awk
Fig. 21.2 empawk3
Note that this time we didn't match the patterns sales and marketi ng specificallyin a field. W
could afford to do that because the patterns occur only in the fourth field, and there's no scope her
for ambiguity. When you run the program, it outputs the averages of the two elements ofpay:
$ awk -f empawk3
.awk empn.lst
Basic Hra Gross
Average 6812 1703 3406 11921
The program is too simple to require any explanation. C programmers will find the syntax qui
comfortable to work with, except that awksimplifies a number of things that requireexplii
specificationin C. There are no type declarations, no initializations and no statementterminato
awk—AnAdvanced Filter 395
Il FUNCTIONS
functions, performing both arithmetic and string operations (Table 21.3).
built-in
delimited by commas and enclosed by a matched
arepassedto a function in c-style,
In contrast to C, however, when a function is used without arguments, the ( )
not be used.
bolsneed
functionstake a variable number of arguments, and one (1ength) uses no argument
of-these
form.The functions are adequately explained here so you can confidently use them in
Savariant
uses identical syntaxes.
perlwhichoften
aretwoarithmeticfunctions which a programmer will expect awkto offer. i nt calculates the
There
portionof a number (without rounding off), while sqrt calculates the square root of a
integral
awkalso has some of the common string handling functions you can hope to find in any
number.
We'll review them briefly.
language.
length I ength determines the length of its argument, and if no argument is present, then it
theentireline as its argument. You can use length (without any argument) to locate lines
assumes
length exceeds 1024 characters:
whose
'k F"l" 'length > 10241 empn.lst
canuselength with a field as well. The following program selects those people who have short
You
names:
You should be familiar with all the functions discussed in this section as they are used in a wide
variety of situations. We'll use them again in perl . awkfeatures some more built-in variablesand
functions, and also allows the user to define her own functions.
Table 21.3 Built-in Functions in awk
Function Description
int (x) Returns integer value ofx
sqrt (x) Returns square root ofx
1ength Returns length of complete line
1ength (x) Returns length ofx
substr(stg,m ,n) Returns portion of string of length n, starting from position m in stringstg
i ndex(sl ,s2) Returns position of string s2 in string sl
spl ch) Splits string stg into array arr using ch as delimiter
system ("cmd i' ) Runs UNIX command cmd and returns its exit status
statements el se is optional
$6 > 7500 {
Analternativeform of this logic places the condition inside
the action component rather than the
criteria. But this form requires the i f statement:
selection
F"l" I { if ($6 > 7500) printf
ifcan be used with the comparison operators and the special symbols -vand !—to match a regular
expression. When used in combination with the logical operators I I and &&,awkprogramming
becomesquite easy and powerful. Some of the earlier pattern matching expressions are rephrased
inthefollowing, this time in the form used by i f:
if ( NR 3 NR 6)
if ( $3 "director" Il $3 "chairman" )
if ( $3 / Ag.m/ )
if ( $2 /[aA]gg?[ar]+wa1/ )
if ( $2 /[cC]ho[wu]dh?urylsa[xk]s?ena/ )
Toillustratethe use of the optional el se statement, let's assume that the dearness allowance is 25%
ofbasicpay when the latter is less than 600, and 1000 otherwise. The if—else structure that
implementsthis logic looks like this:
if ( $6 < 6000 )
da = 0.25*$6
else
da = 1000
Youcan even replace the above i f construct with a compact conditional structure:
$6 < 6000 ? da = 0.25*$6 da = 1000
Thisis the form that C and perl use to implement the logic of a simple i f—else
construct. The ?
and: act as separators of the two actions.
one statement to be executed, they must be bounded by a pair ofcurly
en Youhave more than
if the factors determining the hra and da are in turn dependent on
races(as in C). For example,
e basicpay itself, then you need to use terminators:
398 UNIX:Concepts and Applications
if ( $6 < 6000) {
hra = 0.50*$6
da = 0.25*$6
hra 0.40*$6
da 1000
Note The same SQL-style logic has already been implemented by using three commands in a
pipeline—cut, sort and uniq (14.8.1). That one used only a single line of code!
21.14LOOPINGWITH while
Thewhileloop has a similar role to play; it repeatedly iterates the loop till the control command
For example, the previous for loop used for centering text can be easily replaced with a
succeeds.
whileconstruct:
kfirst initialized
while (k < (55 - length($0))/2) {
printf
print $0
e loophere prints a space and increments the value of k with every iteration. The condition
(k< (55- length ($0) )/2) is tested at the beginning ofevery iteration, and the loop body performed
)nlyifthetest succeeds. In this way, the entire line is filled with a string of spaces before the actual
extisprinted with print $0.
thatthe 1ength function has been used with an argument ($0). This awkunderstands to be the
mireline.Since length, in the absence of an argument, uses the entire line anyway,$0 can be
mitted.Similarly,print $0 may also be replaced by simply print.
1.15 CONCLUSION
' likesed, violates the do-one-thing-well philosophy that generally characterizesall UNIX
Althoughpresented in this chapter as a utility filter, it's more of a scripting language. At the
ofitsentry,you didn't have regular expressionsin other languages. Youcouldn't intermingle
400 UNIX:Concepts and Applications
declarations and initializations,
strings with numbers. Partly because of the abscncc of type an
program is often a fraction of the size of its C counterpart.
awkhas been complctcly overwhclmcd in shccr power by perl—the
latest and most notable
addition
to the UNIX tool kit for severalyears. There is not hing that any UNIX filter can do andwhich
is faster, and in every sense better than them.
perl can't. In fact, perl is even more compact,
many of the constructs
chapter was prepared for you to understand perl better because so arealso
used there. perl is taken up in Chapter 22.
WRAP UP
The awk filter combines features of several filters. awk can manipulate individual fields ($1, $2,etc.
in a line ($0). It uses sed-type addresses and the built-in variable NR to determine line number
Lines are printed with print and printf. The 'latteruses format specifiers to format strings(%s),
integers (%d)and floating point numbers (%f). Each pri nt or pri ntf statementcan be used
shell's operators for redirectionand piping.
awkuses all the comparison operators (like ==, etc.). The special operators and •!—areu
to match regular expressions or negate a match with specific fields. The and $ are usedto anch
a pattern at the beginning or end of a field rather than the line.
awk can perform numeric computation. It overcomes a shell limitation by handling decimal numbers
Variables are used without initializing them or declaring their type. awk accepts x++ as a way
incrementing variables.
awk can take instructionsfrom an externalfile ("f). In this case, the program must not be enclose
within quotes. The BEGINand ENDsections are used to do some pre- and post-processingwork.
report header is generated by the BEGIN section, and a numeric total -is computed in theEN
section.
awk's built-in variables can be used to specify the field delimiter (FS), the number of fields (NF)an
the filename (FILENAME).awk uses one-dimensional arrays where the array subscript can be
string as well.
awkhas a number of built-in functions, and many of them are used for string handling. Youcanfin
the length (1ength), extracta substring (substr) and find the location (index) of a stringwithin
larger string. The system function executesa UNIX command.
The if statementuses the return value of its control command to determine program flow.if als
uses the operators I I and &&to handle complex conditions. The first form of the for loop usesa
array and can be used to count occurrences of an item using a nonnumeric subscript.The Oth er
form resembles its C counterpart.The while loop repeats a set of instructions as long as itscontro
command returns a true value.
perl is betterthan awk.
y011RBRAIN
Displayfrom /etc/passwd a list of users and their shells for those using the Korn shell or
Bash.Order the output by the shell used.
1.2 find outthe next available IJID in /etc/passwd after ignoring all system users placed at the
beginningand up to the occurrence of the user nobody.
13 Thetar command on one system can't accept absolute pathnames longer than 100 characters.
Howcan you use find and awk to generate a list of such files?
1.4/Invertthe name of the individual in emp.1st (14. l) so that the last name occurs first.
1.5} Usea scriptto kill a process by specifying its name rather than the PID.
1.6]Howcan you print the last field of a line even if all lines don't contain the same number of
fields?Assume the field delimiter is the :.
1.7 Listthe users currently using the system along with a count of the number of times they have
logged in.
1.8 Writean awk sequence in a shell script which accepts input from the standard input. The
programshould print the total of any column specified as script argument. For instance,
-u progJ I awk_prog 3 should print the total of the third column in the output of progl.