summaryrefslogtreecommitdiffhomepage
path: root/awk.html.markdown
diff options
context:
space:
mode:
Diffstat (limited to 'awk.html.markdown')
-rw-r--r--awk.html.markdown386
1 files changed, 386 insertions, 0 deletions
diff --git a/awk.html.markdown b/awk.html.markdown
new file mode 100644
index 00000000..3ff3f937
--- /dev/null
+++ b/awk.html.markdown
@@ -0,0 +1,386 @@
+---
+category: tool
+tool: awk
+filename: learnawk.awk
+contributors:
+ - ["Marshall Mason", "http://github.com/marshallmason"]
+
+---
+
+AWK is a standard tool on every POSIX-compliant UNIX system. It's like
+flex/lex, from the command-line, perfect for text-processing tasks and
+other scripting needs. It has a C-like syntax, but without mandatory
+semicolons (although, you should use them anyway, because they are required
+when you're writing one-liners, something AWK excels at), manual memory
+management, or static typing. It excels at text processing. You can call to
+it from a shell script, or you can use it as a stand-alone scripting language.
+
+Why use AWK instead of Perl? Readability. AWK is easier to read
+than Perl. For simple text-processing scripts, particularly ones that read
+files line by line and split on delimiters, AWK is probably the right tool for
+the job.
+
+```awk
+#!/usr/bin/awk -f
+
+# Comments are like this
+
+
+# AWK programs consist of a collection of patterns and actions.
+pattern1 { action; } # just like lex
+pattern2 { action; }
+
+# There is an implied loop and AWK automatically reads and parses each
+# record of each file supplied. Each record is split by the FS delimiter,
+# which defaults to white-space (multiple spaces,tabs count as one)
+# You can assign FS either on the command line (-F C) or in your BEGIN
+# pattern
+
+# One of the special patterns is BEGIN. The BEGIN pattern is true
+# BEFORE any of the files are read. The END pattern is true after
+# an End-of-file from the last file (or standard-in if no files specified)
+# There is also an output field separator (OFS) that you can assign, which
+# defaults to a single space
+
+BEGIN {
+
+ # BEGIN will run at the beginning of the program. It's where you put all
+ # the preliminary set-up code, before you process any text files. If you
+ # have no text files, then think of BEGIN as the main entry point.
+
+ # Variables are global. Just set them or use them, no need to declare..
+ count = 0;
+
+ # Operators just like in C and friends
+ a = count + 1;
+ b = count - 1;
+ c = count * 1;
+ d = count / 1; # integer division
+ e = count % 1; # modulus
+ f = count ^ 1; # exponentiation
+
+ a += 1;
+ b -= 1;
+ c *= 1;
+ d /= 1;
+ e %= 1;
+ f ^= 1;
+
+ # Incrementing and decrementing by one
+ a++;
+ b--;
+
+ # As a prefix operator, it returns the incremented value
+ ++a;
+ --b;
+
+ # Notice, also, no punctuation such as semicolons to terminate statements
+
+ # Control statements
+ if (count == 0)
+ print "Starting with count of 0";
+ else
+ print "Huh?";
+
+ # Or you could use the ternary operator
+ print (count == 0) ? "Starting with count of 0" : "Huh?";
+
+ # Blocks consisting of multiple lines use braces
+ while (a < 10) {
+ print "String concatenation is done" " with a series" " of"
+ " space-separated strings";
+ print a;
+
+ a++;
+ }
+
+ for (i = 0; i < 10; i++)
+ print "Good ol' for loop";
+
+ # As for comparisons, they're the standards:
+ # a < b # Less than
+ # a <= b # Less than or equal
+ # a != b # Not equal
+ # a == b # Equal
+ # a > b # Greater than
+ # a >= b # Greater than or equal
+
+ # Logical operators as well
+ # a && b # AND
+ # a || b # OR
+
+ # In addition, there's the super useful regular expression match
+ if ("foo" ~ "^fo+$")
+ print "Fooey!";
+ if ("boo" !~ "^fo+$")
+ print "Boo!";
+
+ # Arrays
+ arr[0] = "foo";
+ arr[1] = "bar";
+
+ # You can also initialize an array with the built-in function split()
+
+ n = split("foo:bar:baz", arr, ":");
+
+ # You also have associative arrays (actually, they're all associative arrays)
+ assoc["foo"] = "bar";
+ assoc["bar"] = "baz";
+
+ # And multi-dimensional arrays, with some limitations I won't mention here
+ multidim[0,0] = "foo";
+ multidim[0,1] = "bar";
+ multidim[1,0] = "baz";
+ multidim[1,1] = "boo";
+
+ # You can test for array membership
+ if ("foo" in assoc)
+ print "Fooey!";
+
+ # You can also use the 'in' operator to traverse the keys of an array
+ for (key in assoc)
+ print assoc[key];
+
+ # The command line is in a special array called ARGV
+ for (argnum in ARGV)
+ print ARGV[argnum];
+
+ # You can remove elements of an array
+ # This is particularly useful to prevent AWK from assuming the arguments
+ # are files for it to process
+ delete ARGV[1];
+
+ # The number of command line arguments is in a variable called ARGC
+ print ARGC;
+
+ # AWK has several built-in functions. They fall into three categories. I'll
+ # demonstrate each of them in their own functions, defined later.
+
+ return_value = arithmetic_functions(a, b, c);
+ string_functions();
+ io_functions();
+}
+
+# Here's how you define a function
+function arithmetic_functions(a, b, c, d) {
+
+ # Probably the most annoying part of AWK is that there are no local
+ # variables. Everything is global. For short scripts, this is fine, even
+ # useful, but for longer scripts, this can be a problem.
+
+ # There is a work-around (ahem, hack). Function arguments are local to the
+ # function, and AWK allows you to define more function arguments than it
+ # needs. So just stick local variable in the function declaration, like I
+ # did above. As a convention, stick in some extra whitespace to distinguish
+ # between actual function parameters and local variables. In this example,
+ # a, b, and c are actual parameters, while d is merely a local variable.
+
+ # Now, to demonstrate the arithmetic functions
+
+ # Most AWK implementations have some standard trig functions
+ localvar = sin(a);
+ localvar = cos(a);
+ localvar = atan2(b, a); # arc tangent of b / a
+
+ # And logarithmic stuff
+ localvar = exp(a);
+ localvar = log(a);
+
+ # Square root
+ localvar = sqrt(a);
+
+ # Truncate floating point to integer
+ localvar = int(5.34); # localvar => 5
+
+ # Random numbers
+ srand(); # Supply a seed as an argument. By default, it uses the time of day
+ localvar = rand(); # Random number between 0 and 1.
+
+ # Here's how to return a value
+ return localvar;
+}
+
+function string_functions( localvar, arr) {
+
+ # AWK, being a string-processing language, has several string-related
+ # functions, many of which rely heavily on regular expressions.
+
+ # Search and replace, first instance (sub) or all instances (gsub)
+ # Both return number of matches replaced
+ localvar = "fooooobar";
+ sub("fo+", "Meet me at the ", localvar); # localvar => "Meet me at the bar"
+ gsub("e+", ".", localvar); # localvar => "m..t m. at th. bar"
+
+ # Search for a string that matches a regular expression
+ # index() does the same thing, but doesn't allow a regular expression
+ match(localvar, "t"); # => 4, since the 't' is the fourth character
+
+ # Split on a delimiter
+ n = split("foo-bar-baz", arr, "-"); # a[1] = "foo"; a[2] = "bar"; a[3] = "baz"; n = 3
+
+ # Other useful stuff
+ sprintf("%s %d %d %d", "Testing", 1, 2, 3); # => "Testing 1 2 3"
+ substr("foobar", 2, 3); # => "oob"
+ substr("foobar", 4); # => "bar"
+ length("foo"); # => 3
+ tolower("FOO"); # => "foo"
+ toupper("foo"); # => "FOO"
+}
+
+function io_functions( localvar) {
+
+ # You've already seen print
+ print "Hello world";
+
+ # There's also printf
+ printf("%s %d %d %d\n", "Testing", 1, 2, 3);
+
+ # AWK doesn't have file handles, per se. It will automatically open a file
+ # handle for you when you use something that needs one. The string you used
+ # for this can be treated as a file handle, for purposes of I/O. This makes
+ # it feel sort of like shell scripting, but to get the same output, the string
+ # must match exactly, so use a variable:
+
+ outfile = "/tmp/foobar.txt";
+
+ print "foobar" > outfile;
+
+ # Now the string outfile is a file handle. You can close it:
+ close(outfile);
+
+ # Here's how you run something in the shell
+ system("echo foobar"); # => prints foobar
+
+ # Reads a line from standard input and stores in localvar
+ getline localvar;
+
+ # Reads a line from a pipe (again, use a string so you close it properly)
+ cmd = "echo foobar";
+ cmd | getline localvar; # localvar => "foobar"
+ close(cmd);
+
+ # Reads a line from a file and stores in localvar
+ infile = "/tmp/foobar.txt";
+ getline localvar < infile;
+ close(infile);
+}
+
+# As I said at the beginning, AWK programs consist of a collection of patterns
+# and actions. You've already seen the BEGIN pattern. Other
+# patterns are used only if you're processing lines from files or standard
+# input.
+#
+# When you pass arguments to AWK, they are treated as file names to process.
+# It will process them all, in order. Think of it like an implicit for loop,
+# iterating over the lines in these files. these patterns and actions are like
+# switch statements inside the loop.
+
+/^fo+bar$/ {
+
+ # This action will execute for every line that matches the regular
+ # expression, /^fo+bar$/, and will be skipped for any line that fails to
+ # match it. Let's just print the line:
+
+ print;
+
+ # Whoa, no argument! That's because print has a default argument: $0.
+ # $0 is the name of the current line being processed. It is created
+ # automatically for you.
+
+ # You can probably guess there are other $ variables. Every line is
+ # implicitly split before every action is called, much like the shell
+ # does. And, like the shell, each field can be access with a dollar sign
+
+ # This will print the second and fourth fields in the line
+ print $2, $4;
+
+ # AWK automatically defines many other variables to help you inspect and
+ # process each line. The most important one is NF
+
+ # Prints the number of fields on this line
+ print NF;
+
+ # Print the last field on this line
+ print $NF;
+}
+
+# Every pattern is actually a true/false test. The regular expression in the
+# last pattern is also a true/false test, but part of it was hidden. If you
+# don't give it a string to test, it will assume $0, the line that it's
+# currently processing. Thus, the complete version of it is this:
+
+$0 ~ /^fo+bar$/ {
+ print "Equivalent to the last pattern";
+}
+
+a > 0 {
+ # This will execute once for each line, as long as a is positive
+}
+
+# You get the idea. Processing text files, reading in a line at a time, and
+# doing something with it, particularly splitting on a delimiter, is so common
+# in UNIX that AWK is a scripting language that does all of it for you, without
+# you needing to ask. All you have to do is write the patterns and actions
+# based on what you expect of the input, and what you want to do with it.
+
+# Here's a quick example of a simple script, the sort of thing AWK is perfect
+# for. It will read a name from standard input and then will print the average
+# age of everyone with that first name. Let's say you supply as an argument the
+# name of a this data file:
+#
+# Bob Jones 32
+# Jane Doe 22
+# Steve Stevens 83
+# Bob Smith 29
+# Bob Barker 72
+#
+# Here's the script:
+
+BEGIN {
+
+ # First, ask the user for the name
+ print "What name would you like the average age for?";
+
+ # Get a line from standard input, not from files on the command line
+ getline name < "/dev/stdin";
+}
+
+# Now, match every line whose first field is the given name
+$1 == name {
+
+ # Inside here, we have access to a number of useful variables, already
+ # pre-loaded for us:
+ # $0 is the entire line
+ # $3 is the third field, the age, which is what we're interested in here
+ # NF is the number of fields, which should be 3
+ # NR is the number of records (lines) seen so far
+ # FILENAME is the name of the file being processed
+ # FS is the field separator being used, which is " " here
+ # ...etc. There are plenty more, documented in the man page.
+
+ # Keep track of a running total and how many lines matched
+ sum += $3;
+ nlines++;
+}
+
+# Another special pattern is called END. It will run after processing all the
+# text files. Unlike BEGIN, it will only run if you've given it input to
+# process. It will run after all the files have been read and processed
+# according to the rules and actions you've provided. The purpose of it is
+# usually to output some kind of final report, or do something with the
+# aggregate of the data you've accumulated over the course of the script.
+
+END {
+ if (nlines)
+ print "The average age for " name " is " sum / nlines;
+}
+
+```
+Further Reading:
+
+* [Awk tutorial](http://www.grymoire.com/Unix/Awk.html)
+* [Awk man page](https://linux.die.net/man/1/awk)
+* [The GNU Awk User's Guide](https://www.gnu.org/software/gawk/manual/gawk.html) GNU Awk is found on most Linux systems.
+* [AWK one-liner collection](http://tuxgraphics.org/~guido/scripts/awk-one-liner.html)
+* [Awk alpinelinux wiki](https://wiki.alpinelinux.org/wiki/Awk) a technical summary and list of "gotchas" (places where different implementations may behave in different or unexpected ways).
+* [basic libraries for awk](https://github.com/dubiousjim/awkenough)