Awk Micro-Primer

Anybody deeply interested in shell scripting will at some point come across awk [1]. In the spirit of do one thing and do it well, awk can be defined as a line-oriented processor with state. In contrast, sed is a line-oriented with rules that provide only a stateless form.

Parsing columns

While it possesses an extensive set of operators and capabilities, we will cover only a few of these here - the ones most useful in shell scripts.

Awk breaks each line of input passed to it into fields. By default, a field is a string of consecutive characters delimited by whitespace, though there are options for changing this. Awk parses and operates on each separate field. This makes it ideal for handling structured text files -- especially tables -- data organized into consistent chunks, such as rows and columns.

Strong quoting and curly brackets enclose blocks of awk code within a shell script.

# $1 is field #1, $2 is field #2, etc.
$ echo one two | awk '{print $1}'
one
$ echo one two | awk '{print $2}'
two
# But what is field #0 ($0)?
$ echo one two | awk '{print $0}'
one two
# All the fields!

Arithmetic within columns

Awk is a full-featured text processing language with a syntax reminiscent of C. The parsing of columns can be combined with arithmetic operations and the use of variables.

In this example, we have downloaded the dividends of Starbucks for the last 2 years. With the help of awk we can compute how much we'll earn per share by summing all of the 8 dividend values.

Notice that because the downloaded data is in '.csv' format, we configure awk to use "," as a field separator

# Dividens of Starbucks $ cat SBUX.csv Date,Dividends 2022-08-11,0.490000 2022-11-09,0.530000 2023-02-09,0.530000 2023-05-11,0.530000 2023-08-10,0.530000 2023-11-09,0.570000 2024-02-08,0.570000 2024-05-16,0.570000

# Sum the column of the dividends, # print it at the end $ cat SBUX.csv | awk -F',' '{sum+=$2} END {print sum}' 4.32

The new element here is the command block END. Unlike the command block that accumulates the sum, which is executed once per each line of the input data file, END is run once at the end, we use it to print the final accumulated value.

Rendering Markdown (.md) files in the terminal

Markdown is a technique of text decoration for the semantic structures of a page, such as titles, lists, code sections, etc. For example, chapter titles are marked with #, the number of # representing nest level. A multi-line section of code starts and ends with ```, etc.

The format was made particularly popular after its adoption by GitHub for rendering of README.md files.

While the rendering of .md files is usually in the HTML format, in the context of the terminal it would be nice to be able to read such files with coloring using ANSI escape sequences.

A markdown decoration can start on one line and continue on the next, or across multiple-lines, using its stream-processing capabilities, combined with the ability to manage state, makes awk suitable for the task.

#!/usr/bin/awk -f
# md-color.awk
#
# Markdown highlighting for ANSI terminal
#
# NOTE: Processes only a subset of Markdown as a demo of AWK
#
# [2024-06-05]
# Written by Peter M., this is public domain
# (see: https://unlicense.org/)
#
# Usage (on terminal)
# $ md-color.awk demo.md
#
# Usage (via pager)
# $ md-color.awk demo.md | less -r
#
# Colors can be configured via env. variable MD_COLORS
#
function print_wrapped(long_line, prefix) {
  # Split words into an array
  split(long_line, list_of_words)
  line_len = 0
  num_words = length(list_of_words)
  for(i = 1; i <= num_words; i++) {
    word = list_of_words[i]
    word_len = length(word)
    # ~~~ Before word is printed ~~~
    # Check if we go beyond width of the block
    if ((line_len + word_len) > 78) {
      # Wrap the line
      line_len = 0
      printf("\n")
    }
    # At start of the line (line_len is 0 before we print first word)
    if (line_len == 0) {
      # Start with line prefix
      printf(prefix)
      # If inside multi-line bold
      if (bold_section)
        printf(ANSI_BOLD)
      # If inside multi-line inline code
      if (codeinline_section)
        printf(ANSI_CODE)
    }
    # Word begins with "**", set to bold
    if (match(word, "[*]{2}[^ ].*")) {
      printf(ANSI_BOLD)
      bold_section = 1
    }
    # Word begin with "`", set to code
    if (match(word, "`[^ ].*")) {
      printf(ANSI_CODE)
      codeinline_section = 1
    }
    # Advance length by one word
    line_len += word_len + 1
    # Special processing of punctuation at end of words
    # Example: "*word*," or "`word`."
    punct = substr(word, word_len)
    if ((punct == ".") || (punct == ",") || (punct == ":")) {
      # Separate punctuation from the word
      word = substr(word, 0, word_len - 1)
      word_len = word_len - 1
    }
    else
      punct = ""
    # ~~~ Print a word ~~~
    printf("%s", word)
    # ~~~ After the word is printed ~~~
    # Closing of bold
    if (substr(word, word_len - 1) == "**") {
      printf("%s", ANSI_NO_COLOR)
      bold_section = 0
    }
    # Closing of inline code
    if (substr(word, word_len) == "`") {
      printf("%s", ANSI_NO_COLOR)
      codeinline_section = 0
    }
    # Print the punctuation + a space
    printf("%s ", punct)
  }
  # Handle case of empty lines
  if (length(long_line) == 0)
    printf(prefix)
  # The line ends with a new-line
  printf("\n")
}
BEGIN {
  code_section = 0
  codeinline_section = 0
  bold_section = 0
  # Docs: ANSI Escape sequences at
  # https://gist.github.com/fnky/458719343aabd01cfb17a3a4f7296797
  # Use environment variable to set colors, or use defaults
  if (length(ENVIRON["MD_COLORS"]) > 0)
    color_set = ENVIRON["MD_COLORS"]
  else
    color_set = "*e[0;34m *e[0;32m *e[0m*e[1m"
  # Subsusute "*e" with actual ESC character
  gsub(/\*e[[]/, "\033[", color_set)
  # Split into an array
  split(color_set, md_colors)
  ANSI_NO_COLOR = "\033[0m"
  ANSI_BOLD = "\033[1m"
  ANSI_QUOTE = md_colors[1]
  ANSI_CODE = md_colors[2]
  ANSI_HEADING = md_colors[3]
}
# Quote, single line
/^[[:blank:]]*>/ {
  if (code_section == 0) {
    # Strip leading "> "
    line = substr($0, 2)
    # Print folded block + setting color + using "> " as a line prefix
    print_wrapped(line, ANSI_QUOTE "> ")
    # Switch off color at the end of the quote block
    printf("%s", ANSI_NO_COLOR)
    next
  }
}
# Code section, multi-line
/^[[:blank:]]*```/ {
  if (code_section == 0)
      # Activate color for code section
      printf("%s%s\n", ANSI_CODE, $0)
  else
      # Remove color at the end of the code section
      printf("%s%s%s\n", ANSI_CODE, $0, ANSI_NO_COLOR)
  # Toggle flag
  code_section = 1 - code_section
  next
}
# Heading, single line
/^#.*/ {
  if (code_section == 0) {
    printf("%s%s%s\n", ANSI_HEADING, $0, ANSI_NO_COLOR)
    next
  }
}
# Everything else, the text of the file that is not prefixed by any markup
{
  if (code_section == 0)
    # Outside code section, fold long lines, prefix = switch off the color
    print_wrapped($0, ANSI_NO_COLOR)
  else
    # Inside code section, print line unmodified (no folding), apply color for code
    printf("%s%s\n", ANSI_CODE, $0)
}

Download full source code from: https://git.sr.ht/~pem/dotfiles/tree/master/item/scripts/md-color.awk

Resources

There hasn't been much new development of awk in the last couple of decades. Recent enhancements of note are the ability to process Unicode characters and an option for native correct parsing of .csv files.

There are 3 major implementations. The original Unix awk, GNU's gawk, and mawk. In Debian (and by extension Ubuntu), the default is mawk while gawk is available for an easy install.

The LWN has a good article of introduction and history of awk: https://lwn.net/Articles/820829/

Home page of the regulars from the #awk channel on irc.libera.chat: http://awk.freeshell.org/HomePage

A focal point of updates on AWK development is https://github.com/freznicek/awesome-awk

[1]Its name derives from the initials of its authors, Aho, Weinberg, and Kernighan.