0% found this document useful (0 votes)
113 views

Awk-An Advanced Filter

This document discusses the awk command, which is used for text manipulation and report formatting. Some key points: - Awk was introduced in 1977 to augment UNIX with report formatting capabilities. It is named after its authors and was the most powerful text manipulation tool prior to perl. - Awk can perform several tasks like filtering, transforming, and formatting fields in a line. It uses extended regular expressions for pattern matching and has programming constructs like variables, conditional statements, and loops. - Awk splits a line into fields separated by whitespace by default. It can select lines using patterns, conditions, line numbers, and more. Fields can be accessed and formatted using print statements.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
113 views

Awk-An Advanced Filter

This document discusses the awk command, which is used for text manipulation and report formatting. Some key points: - Awk was introduced in 1977 to augment UNIX with report formatting capabilities. It is named after its authors and was the most powerful text manipulation tool prior to perl. - Awk can perform several tasks like filtering, transforming, and formatting fields in a line. It uses extended regular expressions for pattern matching and has programming constructs like variables, conditional statements, and loops. - Awk splits a line into fields separated by whitespace by default. It can select lines using patterns, conditions, line numbers, and more. Fields can be accessed and formatted using print statements.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

awk—An Advanced Filter

Theawkcommand made a late entry into the UNIX system in 1977to augment the tool kit with
suitablereport formatting capabilities. Named after its authors, Aho, Wéinbcrgerand Kernighan,
until the advent of perl, was the most powerful utility for text manipulation. Like sed, it
combinesfeatures of several filters, though its report writing capability is the most useful. awk
appears as gawk(GNU awk) in Linux.
doesn'tbelong to the do-one-thing-well family of UNIX commands. In fact, it can do several
things—andsome of them quite well. Unlike other filters, it operates at thefield level and can
easilyaccess,transform and format individual fields in a line. It also accepts extended regular
expressions(EREs) for pattern matching, has C-type programming constructs, variables and several
built-infunctions. We'll discuss the important awkfeatures in some detail because that will help
manner.
youin understanding perl , which uses most of the awkconstructs, sometimes in identical
WHAT YOU WILLLEARN
components.
• Understand awk'sunusual syntax with its selectioncriteria and action
Splita line into fields and format the output with pri ntf.
any condition.
• Usethe comparison operators to select lines on practically
(EREs) for pattern matching.
Usethe and ! —operators with extended regular expressions
• Handle decimal numbers and use them for computation.
• Usevariableswithout declaring or initializing them.
and END sections.
• Do somepre- and post-processing with the BEGIN
• Examineawk'sbuilt-in variables.
nonnumeric subscript.
• Usearrays and access an array element with a
handling tasks.
• Usethe built-in functions for performing string
its compact one-line conditional.
Makedecisions with the i f statement and
repeatedly
• Usethe for and while loops to perform tasks
386 UNIX:Concepts and Applications

21.1 SIMPLE awkFILTERING


awkis not just a comtnand, but a programming language too. It uscs an unusual syntaxthatuses
two components and rcquircs single quotes and curly braces:
awk options 'selection criteria (action) file(s)
The constituents resemble those offi nd. The selection criteria (a form of addressing) filtersinput
and selects lines for the action component to act upon. This component is enclosed withincurly
braces. The selection critena and action constitute an awkprogram that is surrounded by a
single quotes. These programs are often one line long though they can span several lines aswell.
The selection criteria in awkhave wider scope than in sed. Like there, they can be Patternslike
/negroponte/ or line addresses using awk'sbuilt-in variable, NR.Further, they can also be conditional
expressions using the &&and I I operators as used in the shell. Youcan select lines practicallyonany
condition.
A typically complete awkcommand specifies the selection criteria and action. The following
command selects the directors from emp.1st:
$ awk '/director/ { print } ' emp.lst
98761jai sharma Idirector I production 112/03/5017000
23651barun sengupta Idi rector Ipersonnel 111/05/4717800
10061chanchal singhvi Idirector I sales | 03/09/38 | 6700
65211lalit chowdury Idi rector Imarketing 126/09/4518200
The selection criteria section (/di rector/ ) selects lines that are processed in the actionsectiom
( { print J). If selection criteria is missing, then action applies to all lines of the file. If actionis
missing, the entire line is printed. Either of the two (but not both) is optional, but they mustbe
enclosed within a pair of single (not double) quotes.
The print statement, when used without any field specifiers, prints the entire line. Moreover,since
printing is the default action ofawk, all the following three forms could be considered equivalent
awk '/director/ i emp.lst Printing is the default action
awk I /director/{ print } j emp.lst permitted
Whitespace
awk '/director/ { print $0}' emp.lst $0 is the complete line
For pattern matching, awkuses regular expressions in sed-style:
$ awk -F " I " emp.lst
32121shyam saksena Id.g.m. I accounts | 12/12/55 | 6000
23451j.b. saxena lg.m. Imarketing 112/03/4518000
The regular expressions used by awkbelong to the basic BRE (but not the IRE and T RE) and ERE
variety that's used by grep -E (15.3) or egrep. This means that you can also use multiplepatteins
by delimiting each pattern with a l.
Note An awkprogram must have either the selection criteria or the action, or both, butwithin
single quotes. Double quotes will create problems unless used judiciously.
awk—-AnAdvanced Filter 87

SPLITTING A LINE INTO FIELDS


212 the special parameter, $0, to
indicate the entire line, It also identifies fields by $1, $2, $3.
uses
theseparameters also have a special meaning to the shell, singlc-quoting an awkprogram
Since them from interpretation by the shell.
rotects
Unlikethe other UNIX filters which operate on fields, awkuses a contiguous sequence of spaces
d tabsas a single delimiter. But the sample database (14.1) uses the l, so wc must use the -F
optionto specifyit in our programs. You can use awkto print the name, designation, department
all the sales people:
andsalaryof
F"l" '/sales/ { print $2,$3,$4,$6} ' emp.lst
g.m. sales
a.k. shukla director
6000
singhvi
chanchal sales 6700
manager sales
s.n. dasgupta 5600
manager sales
anil aggarwal 5000
Noticethat a , (comma) has been used to delimit the field specifications.This ensures that each
fieldis separatedfrom the other by a space. If you don't put the comma, the fields will be glued
together.
sofar,the programs have produced readable output, but that is because the file emp.1st contains
fixed-length lines. Henceforth, the input for most awkprograms used in this chapter will come
fromthe file empn.1st which we created with sed in Section 15.10.This file is similar to emp.1st
exceptthat the lines are of variable length. A few lines of the file show the total absence of spaces
beforethe l:
$ head-2 empn.lst
32121shyamsaksenald.g.m. laccountsl 12/12/551600016213
62131karuna gangulylg.m. I accounts 105/06/621630016213
Withthis file as input, we'll use awkwith a line address (single or double) to select lines. If you want
toselectlines 3 to 6, all you have to do is use the built-in variable NRto specifythe line numbers:
$ awk = 6 { print NR, $2,$3,$6 } ' empn.lst
3 n.k. gupta chairman 5400
4 v.k. agrawal gem. 9000
5 j.b. saxena g.m. 8000
6 sumit chakrobarty d.g.m. 6000
Thisis awk'sway of implementing the sed instruction 3,6p. The statement NR - 3 is really a
should appear obvious to C
that is being tested, rather than an assignment; this
COndition
programs, and == is one of the
used in awk
Programmers.NRis just one of those built-in variables
manyoperatorsemployed in comparison tests.
whitespace as the default delimiter instead of a single space
awkis the only filter that uses
or tab.
388 UNIX:Concepts and Applications

21.3 printf: FORMATTINGOUTPUT


statement, you can use awk
The above output is unformatted, but with the C-likc printf
ntf function used in C, butinthi
ofthc Cortnatsused by thc pri
stream formatter. awkaccepts 'nost
numcric, You can now producea list
chapter, the format will bc used fV»rstring data, and %dfor
of all the agarwals:
$ awk -F nl" [ar]+wal/ {
> printf "%3d %-20s %-12s } ' empn.lst
4 v. k. agrawal 9000
9 sudhip Agapwal executive 7500
15 anil aggarwal manager 5000
Like in sed, an awkcommand is considered complete only when the quote is closed.The nameand
designation have been printed in spaces 20 and 12 characters wide, respectively; the - symbolleft.
justifies the output. The line number is three characters wide, right-justified. Note that printf
requires an \n to print a newline after each line. Using the various formats in an awkprogram,you
can have complete control over the way the output is presented.

21.3.1 Redirecting Standard Output


Every print and pri ntf statement can be separately redirected with the > and I symbols.However,
make sure the filename or the command that follows these symbols is enclosedwithindouble
quotes. For example, the followingstatement sorts the output of the pri ntfStatement:
printf "%s %-10s %-12s %-8s\n", "sort"
If you use redirection instead, the filename should be enclosed in quotes in a similar manner:
pri ntf %-10s %-12s , $1, $3, $4, $6 > "mslist"
awkthus provides the flexibilityof separately manipulating the different output streams.Butdon't
forget the quotes!

21.4 THE COMPARISONOPERATORS


How do you print the three fields for the directorsand the chairman? Since the designationfieldis
$3, you have to use it in the selection criteria:
$ awk F" l" '$3 "director" Il $3 "chairman"{
> printf "%-20s %-12s %d\n" $2,$3,$6 } ' empn.lst
n. k. gupta chai rman 5400
lalit chowdury di rector 8200
barun sengupta di rector 7800
jai sharma di rector 7000
chanchal singhvi di rector 6700
This is the first time we matched a pattern with a field. In fact, field matching is possibleonlyin
awkand perl. Here, awkuses the I I and &&logical operators in the same sense as used by C andthe
is
UNIX shell. This command looks for two strings only in the third field ($3).The secondmatch
attempted only if (l l) the first match fails.
awk—-AnAdvanced Filter 389

above condition, you should use the ! - and


Fornegatingthe opcrators instead:
"director" $3 ! "chai rman'i

Theselectioncriteria here translates to this: "Select those lincs where the third field doesn't (! z)
completelymatch the string di rector and (&&)also doesn't ( ! completely match the string chai rman.
that the match is made for the entire field, rather than a string embedded in the field space. Wc
couldn'tverifythat because the file empn.1st has all trailing spaces trimmed. But will a similar awk
work on the file emp.1st which has trailing spaces attached to the strings?
program
$3 "director" Il $3 "chai rman"' emp.lst

So,it won't; the third field contains trailing whitespace, so a perfect match wasn't found. Field
matchingis better done with regular expressions as you'll see in the next topic.

21.4.I - and The Regular Expression Operators


Howdoesone match regular expressions? Previouslywe had used awkwith regular expressions in
thismanner:
" " '/sa[kx] s*ena/' emp.lst
Thismatchesthe pattern saxena and saksena anywhere in the line and not in a specific field. For
matchinga regular expression with a field, awkoffers the and ! operators to match and negate a
match,respectively.With these operators, matching becomesmore specificas seen in the following
examples:
$2 /[cC]ho[wu]dh?ury/ I | $2 /sa [xk] s?ena/ Matchessecondfield
$2 s?ena/ Same as above
/di rector I chai rman/ Neither director nor chairman
Rememberthat the operators and ! —work only with field specifiers ($1, $2 etc.). The delimiting
ofpatternswith the I is an ERE feature, and awkuses extended regular expressions. However, awk
doesn'tacceptthe IRE and TRE used by grep and sed.
Tip To match a string embedded in a field, you must use instead of ==. Similarly, to negate
a match, use ! - instead of !=.
Tolocateonly theg.m.s, you just can't use this:
awk $3 /g.m./ { printf
Becauseg.m. is embedded in d.g.m., locating just the g.m.s requires use of the 'Nand $. These
have slightly different meanings when used by awk.Rather than match lines, they are
scharacters
usedto indicate the beginning and end of afield, respectively,unless you use them with $0. So, if
youuse the condition

/Ag.m./ { print f "


You'll
locateonly theg.m.s and discard the d.g.m.s.
UNIX:Concepts and Applications
search pattern by a Si
To match a string at the beginnina of the field, ptecede the
use a $ formatching pattem at the end of a field.

21.4.2 Number Comparison


awkcan also handlc numbers— both integer and floating type—and make relational onthem
Using operators from the sct shown in Tablc 21.l, you can now print the pay-slips for thosepeople
whose basic pay cxcccds 7500:
$ awk F I '$6 > 7500 {
printf "*-20s '6-12s ) ' empn.lst
v. k. agrawal gem. 9000
ä.b. saxena g.m. 8000
la) it chowdury di rector 8200
bat-un sengupta di rector 7800
Youcan also combine regular expression matching with numeric comparison to locate those,either
born in 1955, or drawing a basic pay greater than 8000:
$ awk - F•tI • ' $6 > 8000 Il $5 -/45$/' empn.lst
01101v.k. agrawal lg.m. Imarketing131/12/401900016521
23451j.b. saxenalg.m. Imarketingl 12/03/451800016521
65211lalit chowduryldi rectorlmarketing126/09/451820016521
Recall that the context address /45$/ matches the string 45 at the end ($) of the field.Withthese
operators, you can now select lines by locating a pattern as a simple string or a regular expressiom
as well as by a relational test on a number.

Table 21.1 The Comparison and Regular ExpressionMatching Operators


Oprator Significance
Less than
Less than or equal to
Equal to
Not equal to
Greater than or equal to
IGreater than
Matches a regular expression
Doesn't match a regular expression

21.5 NUMBER PROCESSING


awkcan perform computations on numbers using the arithmetic operators +, * / and %(modulus).
,
It also overcomesone of the most major limitations of the shell—the
inability to handle decimal
numbers,
In any environment maintaining data for the payroll application,
it's common practiceto store
only the basic pay in the file, This is acceptable because normally
the other elements of paycanbe
Advanced Filter
awk-—An 391
this basic pay. For the purpose of using awk,
d from the sixth field will now signify the basic
derive
ratherthan the salary. Assume that the total salary consists of basic pay, dearness allowance
u»usc rent (@15 % of basic pay).
Youcan use awkto print a rudimentary

"director" {
printf "%-20s %-12s empn.lst
lalit chowdury
sengupta di rector
8200
7800
3280
3120
1230
barun 1170
di rector
jai shama
singhvi di rector
7000
6700
2800
2680
1050
chanchal 1005

21.6VARIABLES
whileawkhas certain built-in variables, like NRand $0,representing the line number and the entire
line,respectively,it also permits the user to use variables of her choice. A user-defined variable used
byawkhas two special features:
No typedeclarations are needed.
Bydefault,variables are initialized to zero or a null string, depending on its type. awkhas a
mechanism of identifying the type of variable used from its context.
cannow use these properties to print a serial number, using the variable kount, and apply it to
You
selectthose directors drawing a salary exceeding 6700:
$ awk-F" I " '$3 director" &&$6 > 6700 {
> kount= kount + 1
> printf "%3d %-20s %-12s kount,$2,$3,$6 } ' empn.lst
1 lalit chowdury di rector 8200
2 barun sengupta di rector 7800
3 jai sharma di rector 7000
Theinitialvalue of kount was 0 (by default). That's why the first line is correctly assigned the
numberl. awkalso accepts the C-style incrementing forms:
kount++ Same as kount - kount + 1
kount+= 2 Same as kount = kount + 2
printf ++kount Increments kount beforeprinting
Tip NO type declarations or initial values are required for user-defined variables used in an
awkprogram. awk identifies their type and initializes them to zero or null strings.

21.7THE -f OPTION: STORING awkPROGRAMS IN A FILE


oushouldhold large awkprograms in separate files and provide them with the . awkextension for
easieridentification.Let's first store the previous program in the file empawk.awk:
$ cat empawk.awk
. director" $6 > 6700 {
pmntf "%3d%-20s %-12s %d\n% ++kount,$2,$3,$6 }
392 UNIX:Concepts and Applications
thc awkprogram. Youcan now
Observe that this timc wc haven't any quotc« to enclose
output:
with the -f filena'ne option to obtain thc satnc
awk -F I -f empawk.awkempn.lst
in the file is not enclosed
the -f option, make sure the program stored
If you use awkwith is specified in the command
Note
quotes. awk uses single quotes only when the program
Within
a shell script.
line or the entire awk command line is held in

21.8 THE BEGINAND ENDSECTIONS


by the address, and if there are no address
awkstatements are usually applicd to all lincs sclcctcd to print something before e,
processing
if you have
then they are applicd to every linc of input. But,
section can be used quite gainfully.Similarly
the first line, for example, a heading, then thc BEGIN
processing is over,
the ENDsection is useful in printing some totals after the
The BEGIN and END sections are optional and take the form
)
Both require curly braces
BEGIN { action
END { action )
awkprogram. You can usethem
These two sections, when present, are delimited by the body of the
end. Store this awkprogram,
to print a suitable heading at the beginning and the average salary at the
in a separate file empawk2. awk(Fig. 21.1).

BEGIN {
printf oyee
) $6 > 7500 { # Increment the variables for the serial numberand the pay
kount++ ; tot+= $6 # Multiple assignments in one line
printf 3d %-20s %-12s

END {
printf averagebasic pay is tot/kount

Fig. 21.1 empawk2.awk


section here is meant to print
Like the shell, awkalso uses the # for providing comments. The BEGIN
a suitable heading, offset by two tabs (\t\t), while the END section should print the averagepay
(tot/kount) for those lines selected. To execute this program you have to use the -f option:
$ awk -F" l" -f empawk2.awk empn.lst
Employeeabstract
I v, k. agrawal 9000
2 j 8b. saxena gem, 8000
3 lalit chowdury di rector 8200
4 barun sengupta di rector 7800
The average basic pay is 8250
awk—-AnAdvanced Filter 93
Always start the opening brace in the same
line the section (BEGIN or END)begins. If you
don't do that, awk will generate some strange messages!

21.9BUILT-IN VARIABLES
hasseveralbuilt-in variables (Table 21.2). They are all assigned automatically, though it is also
iblefor a user to reassign some of them. You have already used NR,
which signifies the record
numberof the current line. We'll now have a brief look at some of the other variables.

Table21.2Built-in Variables Used by awk


Variable Function
Cumulative number of lines read
Input field separator
OFS Output field separator
Number of fields in current line
FILENAME Current input file
ARGC Number of arguments in command line
ARGV List of arguments

TheFSVariable As stated elsewhere, awkuses a contiguous string of spaces as the default field
delimiter.FSredefines this field separator, which in the sample database happens to be the l. When
usedat all, it must occur in the BEGINsection so that the body of the program knows its value before
it starts processing:

BEGIN{ FS l"
Thisis an alternative to the -F option of the command which does the same thing.
TheOFSVariable When you used the print statement with comma-separated arguments, each
argumentwas separated from the other by a space. This is awk'sdefault output field separator, and
canbereassigned using the variable OFSin the BEGINsection:

BEGIN { }

Whenyoureassign this variable with a (tilde), awkwill use this character for delimiting the pri nt
arguments.This is a useful variable for creating lines with delimited fields.
TheNFVariable NFcomes in quite handy in cleaning up a database of lines that don't contain the
nghtnumber of fields. By using it on a file, say empx.1st, you can locate those lines not having six
and which have crept in due to faulty data entry;
fields,
$ awk BEGIN { FS
> print "Record No " NR "has " , NF, " fields"} ' empx.lst
RecordNo 6 has 4 fields
RecordNo 17 has 5 fields
Applications
394 UNIX:Concepts and
the name ofthc current filc being processed
ftLENAME storcs command line. By default,
filcnamcs in thc Like
handle multiple
and sed, awkcan also
it to do so:
the filename, but you can instruct
'
print FILENAME, $0 )
4000 ( depending on the file
that docs different things being
With FILENAME,you can devise logic

21.10 ARRAYS be virtually anything;


arrays. The index for an array can it
awkhandles one-dimensional an array is considered to be declared the
necessary;
a st,ing. No array declarations are zero, unless initialized explicitly.
used, and is automatically initialized to
21.2), we use arrays to store the totals of the basic pay,da hr
In the program empawk3•awk(Fig. Assume that the da is 25%, and hra 50%ofb' • arid
marketing people.
gross pay of the sales and element of pay, and also the gross pay:
store the totals of each
Use the tot( array to

BEGIN {

printf "Basic Hra Gross"


) /sales Imarketing/ { gross pay
# Calculate the da, hra and the
• - $6+hra+da
da = 0.25*$6 ; hra - 0.50*$6, gp
# Store the aggregates in separate arrays
tot[l] += $6 ; tot [2] += da ; tot [3] += hra , tot[4] gp
kount++

END{ # Print the averages


printf "\t Average
tot [l]/kount, tot [2] /kount, tot [3] /kount, tot [4] /kount

.awk
Fig. 21.2 empawk3
Note that this time we didn't match the patterns sales and marketi ng specificallyin a field. W
could afford to do that because the patterns occur only in the fourth field, and there's no scope her
for ambiguity. When you run the program, it outputs the averages of the two elements ofpay:
$ awk -f empawk3
.awk empn.lst
Basic Hra Gross
Average 6812 1703 3406 11921
The program is too simple to require any explanation. C programmers will find the syntax qui
comfortable to work with, except that awksimplifies a number of things that requireexplii
specificationin C. There are no type declarations, no initializations and no statementterminato
awk—AnAdvanced Filter 395

Il FUNCTIONS
functions, performing both arithmetic and string operations (Table 21.3).
built-in
delimited by commas and enclosed by a matched
arepassedto a function in c-style,
In contrast to C, however, when a function is used without arguments, the ( )
not be used.
bolsneed
functionstake a variable number of arguments, and one (1ength) uses no argument
of-these
form.The functions are adequately explained here so you can confidently use them in
Savariant
uses identical syntaxes.
perlwhichoften
aretwoarithmeticfunctions which a programmer will expect awkto offer. i nt calculates the
There
portionof a number (without rounding off), while sqrt calculates the square root of a
integral
awkalso has some of the common string handling functions you can hope to find in any
number.
We'll review them briefly.
language.
length I ength determines the length of its argument, and if no argument is present, then it
theentireline as its argument. You can use length (without any argument) to locate lines
assumes
length exceeds 1024 characters:
whose
'k F"l" 'length > 10241 empn.lst
canuselength with a field as well. The following program selects those people who have short
You
names:

awk-F"l " 'length($2) < Il l ernpn.lst


index index(sl ,s2) determines the position of a strings2 within a larger stringsl. This function
isespecially
useful in validating single character fields. If you have a field which can take the values
i,b,c,dor e, you can use this function to find out whether this single character field can be located
withinthe string abcde:

Thisreturns the value


2.
substr The ,n) function extracts a substring from a string stg. m represents
thestartingpoint of extraction, and n indicates the number of characters to be extracted. Because
valuescan also be used for computation, the returned string from this function can be used to
string
selectthose born
between 1946 and 1951:
$awk < 52' empn.lst
2365barun > 45
sengupta 111/05/47 1780012365
3564sudhir Agarwal di rectorlpersonnel
executivelpersonnel 106/07/471750012365
4290jayant Choudhury I executivelproduction107/09/501600019876
98761jai sharmaldi rectorlproductionl 12/03/501700019876
Youcanneverget this either sed or grep because their regular expressionscan never
output with identifying
matchthe numbers between 46 and 51. Note that awkdoes indeed possess a mechanism of
the typeOfexpression It identified the date field as a string for using substr and
th from its context.
enConvertedit to
a number for making a numeric comparison.
39 UNIX:Concepts and Applications
numeric and string variables. You can use a stringfor
awkmakes no distinction between
numeric computations as well.
delimiter ch and stores the fields in an
split breaks up a string stg on the
to the format YYYYMMDD: array
arr[] . Here's how you can convert the date field empn.lst
$ awk -F\ I ; print
19521212
19501203
19431904

because it explicitly picks up the fifthfield,


Youcan also do this with sed, but this method is superior
whereas sed would transform the only date field that it finds.
beginning of the report. Forrunning
system Youmay want to print the system date at the
function. Here are two examples:
any UNIX command within awk,you'll have to use the system
BEGIN {
system("tputclear") Clears the screen
Executes the UNIX date command

You should be familiar with all the functions discussed in this section as they are used in a wide
variety of situations. We'll use them again in perl . awkfeatures some more built-in variablesand
functions, and also allows the user to define her own functions.
Table 21.3 Built-in Functions in awk
Function Description
int (x) Returns integer value ofx
sqrt (x) Returns square root ofx
1ength Returns length of complete line
1ength (x) Returns length ofx
substr(stg,m ,n) Returns portion of string of length n, starting from position m in stringstg
i ndex(sl ,s2) Returns position of string s2 in string sl
spl ch) Splits string stg into array arr using ch as delimiter
system ("cmd i' ) Runs UNIX command cmd and returns its exit status

21.12 CONTROL FLOW—THEif STATEMENT


awkhas practically all the features of a modern programming language. It has conditional structUreS
(the if statement) and loops (while and for). They all execute a body of statements depending0n
the success or failure of the control command. This is simply a condition that is specified in thefirst
line of the construct. Since you have already been introduced to similar structures in the shell'they
will be described briefly in these sections.
The if statement can be used when the &&and I I are found to be inadequate for certaintasks•Its
The
behavior is well known to all programmers, and has also been elaborated in Section16•6•
statement here takes the form
awk—An Advanced Filter 97
conditionis true ) {
f(
statements

statements el se is optional

C none ofthe control flow constructs


Likein , require to use curly braces
beexecuted. But when there are multiple actions to ifthere's only one statement
take, the statements must be enclosed within
pairOfcurlybraces. Moreover, the control command must be
enclosed in parentheses.
Mostofthe addresses that have been used so far reflect the
logic normally used in the if statement.
Ina previousexample, you had selected the lines where the
basic pay exceeded 7500,by using the
conditionas the selection criteria:

$6 > 7500 {
Analternativeform of this logic places the condition inside
the action component rather than the
criteria. But this form requires the i f statement:
selection
F"l" I { if ($6 > 7500) printf
ifcan be used with the comparison operators and the special symbols -vand !—to match a regular
expression. When used in combination with the logical operators I I and &&,awkprogramming
becomesquite easy and powerful. Some of the earlier pattern matching expressions are rephrased
inthefollowing, this time in the form used by i f:
if ( NR 3 NR 6)
if ( $3 "director" Il $3 "chairman" )
if ( $3 / Ag.m/ )
if ( $2 /[aA]gg?[ar]+wa1/ )
if ( $2 /[cC]ho[wu]dh?urylsa[xk]s?ena/ )
Toillustratethe use of the optional el se statement, let's assume that the dearness allowance is 25%
ofbasicpay when the latter is less than 600, and 1000 otherwise. The if—else structure that
implementsthis logic looks like this:
if ( $6 < 6000 )
da = 0.25*$6
else
da = 1000

Youcan even replace the above i f construct with a compact conditional structure:
$6 < 6000 ? da = 0.25*$6 da = 1000
Thisis the form that C and perl use to implement the logic of a simple i f—else
construct. The ?
and: act as separators of the two actions.
one statement to be executed, they must be bounded by a pair ofcurly
en Youhave more than
if the factors determining the hra and da are in turn dependent on
races(as in C). For example,
e basicpay itself, then you need to use terminators:
398 UNIX:Concepts and Applications

if ( $6 < 6000) {
hra = 0.50*$6
da = 0.25*$6
hra 0.40*$6
da 1000

used in awk, awkuses the ( and


Note There's no endi f or fi terminator for the i f statement
to define a block of code.

21.13 LOOPING WITH for


awksupports two loops—for and while. They both execute the loop body as long as the control
command returns a true value. for has two forms. The easier one resembles its C COUnterpartA
simple example illustrates the first form:
for ( ,
This form also consists of three components; the first component initializes the valueof K,the
second checks the condition with every iteration, while the third sets the increment usedforevery
iteration.
for is useful for centering text, and the following example uses awkwith echo in a pipelinetodo
that:
$ echo "
> Income statement\nfor\nthe monthof August, 2002\nDepartment: Sales" I
> awk for (k = 1 • k < (55 - length($O)) / 2 ;
> pcintf
> print $0 } '
Income statement
for
the monthof August, 2002
Department : Sales
The loop here uses the first printf statement to print the required number of spaces(pagewidth
assumed to be 55). The line is then printed with the second pri ntf statement, which fallsoutside
the loop. This is a useful routine which can be used to center some titles that normallyappearat
the beginning of any report.
The second form of the for loop doesn't have a parallel in any known programminglanguage
except perl. It has an unusual syntax:
for ( in array)
commands
This form uses an array where k is the subscript. The unusual aspect of this form is that the subscript
a
here is not restricted to integers, but can even be a string! This makes it very simpleto display
awk---An Advanced Filter 99

countoftheemployees,grouped according to dcsignation (the third field). Since designation happens


you can use $3 as the subscript
bethethird field, of the array kount[ J:
to
for ( desig in kount)
> END{ desig,
print kount[desigl ) i empn.lst
chaiman 1
executive 2
director 4
manager 2
d.g.m.
Theprogramhere analyzes the database to reveal the break-up of employees, grouped on the basis
The array kount[] takes as its subscript nonnumeric values like g.m., chai rman,
d designation.
etc. for is invoked in the ENDsection to print the subscript (desig) and the number of
executive,
occurrences
of the subscript (kount [desi g] ). Note that you don't need to sort the input file to print
thisreport!

Note The same SQL-style logic has already been implemented by using three commands in a
pipeline—cut, sort and uniq (14.8.1). That one used only a single line of code!

21.14LOOPINGWITH while
Thewhileloop has a similar role to play; it repeatedly iterates the loop till the control command
For example, the previous for loop used for centering text can be easily replaced with a
succeeds.
whileconstruct:
kfirst initialized
while (k < (55 - length($0))/2) {
printf

print $0
e loophere prints a space and increments the value of k with every iteration. The condition
(k< (55- length ($0) )/2) is tested at the beginning ofevery iteration, and the loop body performed
)nlyifthetest succeeds. In this way, the entire line is filled with a string of spaces before the actual
extisprinted with print $0.

thatthe 1ength function has been used with an argument ($0). This awkunderstands to be the
mireline.Since length, in the absence of an argument, uses the entire line anyway,$0 can be
mitted.Similarly,print $0 may also be replaced by simply print.
1.15 CONCLUSION
' likesed, violates the do-one-thing-well philosophy that generally characterizesall UNIX
Althoughpresented in this chapter as a utility filter, it's more of a scripting language. At the
ofitsentry,you didn't have regular expressionsin other languages. Youcouldn't intermingle
400 UNIX:Concepts and Applications
declarations and initializations,
strings with numbers. Partly because of the abscncc of type an
program is often a fraction of the size of its C counterpart.
awkhas been complctcly overwhclmcd in shccr power by perl—the
latest and most notable
addition
to the UNIX tool kit for severalyears. There is not hing that any UNIX filter can do andwhich
is faster, and in every sense better than them.
perl can't. In fact, perl is even more compact,
many of the constructs
chapter was prepared for you to understand perl better because so arealso
used there. perl is taken up in Chapter 22.
WRAP UP
The awk filter combines features of several filters. awk can manipulate individual fields ($1, $2,etc.
in a line ($0). It uses sed-type addresses and the built-in variable NR to determine line number
Lines are printed with print and printf. The 'latteruses format specifiers to format strings(%s),
integers (%d)and floating point numbers (%f). Each pri nt or pri ntf statementcan be used
shell's operators for redirectionand piping.
awkuses all the comparison operators (like ==, etc.). The special operators and •!—areu
to match regular expressions or negate a match with specific fields. The and $ are usedto anch
a pattern at the beginning or end of a field rather than the line.
awk can perform numeric computation. It overcomes a shell limitation by handling decimal numbers
Variables are used without initializing them or declaring their type. awk accepts x++ as a way
incrementing variables.
awk can take instructionsfrom an externalfile ("f). In this case, the program must not be enclose
within quotes. The BEGINand ENDsections are used to do some pre- and post-processingwork.
report header is generated by the BEGIN section, and a numeric total -is computed in theEN
section.
awk's built-in variables can be used to specify the field delimiter (FS), the number of fields (NF)an
the filename (FILENAME).awk uses one-dimensional arrays where the array subscript can be
string as well.
awkhas a number of built-in functions, and many of them are used for string handling. Youcanfin
the length (1ength), extracta substring (substr) and find the location (index) of a stringwithin
larger string. The system function executesa UNIX command.
The if statementuses the return value of its control command to determine program flow.if als
uses the operators I I and &&to handle complex conditions. The first form of the for loop usesa
array and can be used to count occurrences of an item using a nonnumeric subscript.The Oth er
form resembles its C counterpart.The while loop repeats a set of instructions as long as itscontro
command returns a true value.
perl is betterthan awk.

TEST YOUR UNDERSTANDING


21.1 What is the difference etweenprint and print $0?
21.2 What is wrong with this statement?pri ntf "%s %-20s\n", $1, $6 | sort
21.3 Print only the odd-numbered lines of a file.
awk—AnAdvanced Filter 401

all blank lines (including those that


Howdo you delete contain whitespace) from a file?
lines longer than 100 and smaller than 150
Useawkto locate characters.
1.5
findoutrecursivelythe files in your home directory that have been last modified on January
current year at the 1 Ith hour.
6 of the
findoutthe total space usage of ordinary files in the current directory.
Usea for loop to center the output of the command echo "DOCUMENT LIST", where the page
widthis 55 characters.
Repeatthe previous exercise with a while loop.

y011RBRAIN
Displayfrom /etc/passwd a list of users and their shells for those using the Korn shell or
Bash.Order the output by the shell used.
1.2 find outthe next available IJID in /etc/passwd after ignoring all system users placed at the
beginningand up to the occurrence of the user nobody.
13 Thetar command on one system can't accept absolute pathnames longer than 100 characters.
Howcan you use find and awk to generate a list of such files?
1.4/Invertthe name of the individual in emp.1st (14. l) so that the last name occurs first.
1.5} Usea scriptto kill a process by specifying its name rather than the PID.
1.6]Howcan you print the last field of a line even if all lines don't contain the same number of
fields?Assume the field delimiter is the :.
1.7 Listthe users currently using the system along with a count of the number of times they have
logged in.
1.8 Writean awk sequence in a shell script which accepts input from the standard input. The
programshould print the total of any column specified as script argument. For instance,
-u progJ I awk_prog 3 should print the total of the third column in the output of progl.

You might also like