Parsing JSON in Forty Lines of Awk

·

JSON is not a friendly format to the Unix shell — it’s hierarchical, and cannot be reasonably split on any character (other than the newline, which is not very useful) as that character might be included in a string. There are well-known tools such as jq that let you correctly parse JSON documents in the shell, but all require an additional dependency. Another option is to use Python, which is ubiquitous enough that it can be expected to be installed on virtually every machine, and for new projects would be the recommended option.

However, I already had a working POSIX shell script that now had a requirement to read and parse JSON. It had previously extracted values from HTML which, while also being hierarchical, can be reliably split on certain characters (the angle brackets) for basic extraction of values. awk is the closest thing to a real programming language that’s available in the POSIX shell, so I thought I’d try to write a basic JSON parser in it. I had already written a full-blown one before, so I knew it was doable, but I needed something more concise.

First, there are some caveats. JSON is notoriously tricky to get completely right, despite its simple grammar. The following code assumes that it will be fed valid JSON. It has some basic validation as a function of the parsing and will most likely throw an error if it encounters something strange, but there are no guarantees beyond that. In my case, I’m reading JSON from a single, trusted source, so this is an acceptable constraint.

The interface is simple, a single function that accepts a JSON document and a dotted path to a key or array index, and returns the corresponding value. It can be used like so:

items = get_json_value(json, "payload.items")
while ((item = get_json_value(items, i++))) {
	type = decode_json_string(get_json_value(item, "type"))
	name = decode_json_string(get_json_value(item, "name"))
}

To keep things simple, the same function handles both arrays and objects. In JavaScript, arrays are roughly equivalent to objects with integer keys, and we use the same approach here. This is the implementation, expanded and annotated:

# The function takes two parameters, the JSON object/array and the desired key
# The rest are local variables (awk only allows local variables in the form
# of function parameters)
function get_json_value( \
	s, key,
	type, all, rest, isval, i, c, k \
) {
	# Trim leading whitespace, if any
	if (match(s, /^[[:space:]]+/)) s = substr(s, RLENGTH+1)

	# Get the type of value by its first character
	type = substr(s, 1, 1)

	# This variable is needed for when we recursively call the function
	# It will be true if the key argument is undefined, since such
	# variables can behave as either a string or a number in awk
	all = key == "" && key == 0

	# If this is a primitive
	if (type != "{" && type != "[") {
		# Ensure a key is not passed
		if (!all) error("invalid json array/object " s)
		# Parse the value
		if (!match(s, /^(null|true|false|"(\\.|[^\\"])*"|[.0-9Ee+-]+)/))
			error("invalid json value " s)
		# And return it
		return substr(s, 1, RLENGTH)
	}

	# Get the first part of the key (which we will be looking for)
	# if the path is dotted and save the rest for now
	if (!all && (i = index(key, "."))) {
		rest = substr(key, i+1)
		key = substr(key, 1, i-1)
	}

	# isval keeps track of whether we are looking at a JSON key or value
	# In an array, all items are values
	# k is the current key
	# If this is an array, it is the index, which starts at 0
	if ((isval = type == "[")) k = 0

	# Loop over the characters in the provided JSON
	# Skip the opening brace or bracket (to avoid infinite recursion) and
	# increment the index by the length of the token
	for (i = 2; i <= length(s); i += length(c)) {
		# Skip over whitespace
		if (match(substr(s, i), /^[[:space:]]+/)) {
			c = substr(s, i, RLENGTH)
			continue
		}

		# Temporarily assign the first character to our token variable
		c = substr(s, i, 1)

		# If it's a closing brace or bracket, we've reached the end of
		# the object or array, so exit the loop
		if (c == "}" || c == "]") break

		# If we find a comma in an object, the next item will be a key,
		# so reset isval. If it's an array, increment the index
		else if (c == ",") { if ((isval = type == "[")) ++k }

		# If we see a colon, the next token will be a value
		else if (c == ":") isval = 1

		# Otherwise, we expect a JSON value
		else {
			# If the key matches, this is our desired value,
			# so pass the rest of the key and return the result
			if (!all && k == key && isval)
				return get_json_value(substr(s, i), rest)
			# Otherwise, get the full value
			c = get_json_value(substr(s, i))
			# If this is a string and we're not expecting a value,
			# then it's a key, so trim the quotes and save it
			if (c ~ /^"/ && !isval) k = substr(c, 2, length(c)-2)
		}
	}

	# Do a basic check that the object or array was properly closed
	if ((type == "{" && c != "}") || (type == "[" && c != "]"))
		error("unterminated json array/object " s)

	# If we're here, it means we didn't find the value we're looking for
	# so only return something if the whole array or object was requested
	if (all) return substr(s, 1, i)
}

To make the parser more useful, you’ll also need a function to do some decoding of JSON strings. This is a simple one, which handles everything except Unicode escape sequences, but throws an error if it encounters one:

function decode_json_string(s, out, esc) {
	if (s !~ /^"./ || substr(s, length(s), 1) != "\"")
		error("invalid json string " s)

	s = substr(s, 2, length(s)-2)

	esc["b"] = "\b"; esc["f"] = "\f"; esc["n"] = "\n"; esc["\""] = "\""
	esc["r"] = "\r"; esc["t"] = "\t"; esc["/"] = "/" ; esc["\\"] = "\\"

	while (match(s, /\\/)) {
		if (!(substr(s, RSTART+1, 1) in esc))
			error("unknown json escape " substr(s, RSTART, 2))
		out = out substr(s, 1, RSTART-1) esc[substr(s, RSTART+1, 1)]
		s = substr(s, RSTART+2)
	}

	return out s
}

And finally, since there is no built-in error function in awk, you can use something like this:

function error(msg) {
	printf "%s: %s\n", ARGV[0], msg > "/dev/stderr"
	exit 1
}