16. October 2020

Working with Unix Pipes

Microservices in action

Tags: Bash

Pipes are one of the cornerstones of software development in Unix and can be seen as a precursor to the central principles of microservices. A practical example from the field of web development.

The term microservices has been an indispensable vocabulary for years when it comes to the principles of software architecture. Basically, it is about building applications in a modular way so that a small but clear and interchangeable program is used for every sub-task, no matter how modest.

The use of universal and software-independent interfaces is decisive for the interaction of these programs. One such interface is the pipe developed for Unix, which connects two processes with one another in such a way that the output stream of one process can be further processed as the input stream of another process.

The philosophy behind it was already described by the developer of the pipes, Doug McIlroy, in the 1960s with the analogy of connecting garden hoses:

We should have some ways of coupling programs like garden hose-screw in another segment when it becomes necessary to massage data in another way.

The pipe command is the character |, where on the left side of | is the program that provides the output and on the right side of | the program that reads the output as input:

program1 | program2

This principle can be extended to any number of programs:

program1 | program2 | program3 | program4 | program5 etc.

Example

The problem

The content management system Kirby, written in PHP, provides a function t() to display texts in different languages. The general form of t() is:

t(string|array $key, string $fallback = null): mixed

The function uses the key $key to refer to a text in a language file to be created for each desired language, which either contains a PHP array with the translations or reads the translations in YAML format from a separate file (for details see Flexible language variables).

An example for translations in a PHP array (here for the German language) could look like this:

'translations' => [
    'change' => 'Ändern',
    'confirm' => 'OK',
    'copy' => 'Kopieren',
    'create' => 'Erstellen'
  ]

And the same translations in YAML format:

change: Ändern
confirm: OK
copy: Kopieren
create: Erstellen

If the $key is not found, the optional parameter $fallback is displayed.

The function can be used in Kirby's PHP program code and could look like this:

...
<?= t('copy','Copy') ?>
...
<?= t("create") ?>
...

Both variants are permitted. If the target language is German, "Kopieren" is displayed in the first case. If the copy key does not exist in the German language file, "Copy" is displayed instead. In the second case either "Erstellen" or nothing is displayed, because if there is neither a translation nor a fallback, nothing can be in the output.

Before a website created with Kirby goes live, you have to make sure that all keys of all t() functions used have been recorded and translated.

Solution

First of all, it is necessary to find all files in which the function t() could be used. At Kirby, these files can be so-called templates, snippets or controllers, which are stored in the corresponding subdirectories of the site directory.

The find program can be used for this:

$ find controllers snippets templates -type f

Used this way, the site directory must be the working directory at the time of execution. find finds all entries recursively in the specified directories and restricts the output to files (-type f).

The output is a list of filenames that - shortened - might look like this:

controllers/blog.php
snippets/blog/excerpt.php
snippets/blog/header.php
templates/blog-article.php
templates/blog.php

Now it must be checked in which of these files the function t() is actually used.

The grep program can be used for this. grep searches line by line for the occurrence of a pattern in a file or in the input stream. However, a simple redirection of the output stream from find to the input stream for grep would lead to grep looking for the pattern in the path and file names, but it should search within the files.

You could work on the output of find with a loop, but it's easier to pass it to the xargs program, which uses the elements in the input stream and another program as a parameter to create and execute new commands.

To accomplish this, the file list output from find must be passed on to the xargs program, which then runs the grep program for each file:

$ find controllers snippets templates -type f | xargs -r grep PATTERN

The -r parameter for xargs means that no command is created and executed if an element in the input stream is empty.

However, the following must also be noted: If the file names supplied by find contain special characters such as spaces, xargs will not work because it interprets spaces (and newlines) as separators between multiple elements in the input stream.

Fortunately, find can be made to separate the individual elements of its output with a null character (NUL) instead of a line feed. To do this, append the -print0 option to the find command and use the additional -0 parameter to inform xargs that the elements in the input stream are separated from each other by a null character:

$ find controllers snippets templates -type f -print0 | xargs -0r grep PATTERN

What is missing is the pattern grep should look for in the files.

This pattern can consist of any characters or a regular expression. Although it is not possible to go into the details of regular expressions in this article, so-called extended regular expressions are mainly used here and below.

grep is instructed to search for an extended regular expression with the -E option. When grep finds a string that matches this expression, it usually prints the entire line in which the string occurs. In addition, the line in question is only output once even if the string occurs several times in the line.

Neither is desirable. Happily grep can be instructed via the optional -o parameter to output only the character string that corresponds to the pattern and to output all occurrences individually even if there are several findings within a line.

In addition, the output of the file name can be suppressed (option -h).

The continuation of the command is (from here split into several input lines for reasons of space):

$ find controllers snippets templates -type f -print0 \
 | xargs -0r grep -Eoh "\bt\([^\)]+\)"

The regular expression in quotation marks matches all occurrences of t( which are preceded by a word boundary and followed by any one or more characters, with the exception of ). ) concludes the pattern. The word boundary controls that functions with the name t() are found, but functions such as test() are omitted.

The output of this command is a list of the t() function used:

t( 'change', 'Change' )
t("create","Create")
t( 'copy' )
t( 'change' , 'Change' )
t( "create", 'Build' )
t( 'copy', 'Copy' )

As you can see, it is possible that several function calls are found with the same key, which is not surprising, because the t() function can be used multiple times with the same key if the same text is to be displayed each time.

You can also see that in some function calls the parameters are specified with double quotes instead of single quotes and the use of spaces is inconsistent. One function call in this example does not contain an indication of a fallback.

The next goal is to design the output so that it can be processed either as a PHP array or as a YAML.

In the following, therefore, the quotation marks are first standardized by replacing all occurrences of a double quotation mark " with a single quotation mark '.

The tr program can be used for this purpose, which is specifically designed for this task:

$ find controllers snippets templates -type f -print0 \
 | xargs -0r grep -Eoh "\bt\([^\)]+\)" \
 | tr \" \'

The output is:

t( 'change', 'Change' )
t('create','Create')
t( 'copy' )
t( 'change' , 'Change' )
t( 'create', 'Build' )
t( 'copy', 'Copy' )

Another standardization makes sense, namely that function calls without a fallback are still written with a second parameter, which in this case consists of an empty string.

For example, the line t( 'copy' ) should be written as t( 'copy','').

This is a good job for the sed program, an editor for editing an input stream and which can be used to perform many operations:

$ find controllers snippets templates -type f -print0 \
 | xargs -0r grep -Eoh "\bt\([^\)]+\)" \
 | tr \" \' \
 | sed -e "/,/! s/'\s*)/','')/"

The -e parameter adds a script to sed that manipulates the input stream. In this case the script consists of only two commands: The first /,/! is a condition, namely that no comma may appear in the line. Only then the second command s/'\s*)/','')/ is executed.

This command replaces a character string that matches a regular expression with a text (in general: s/regex/replacement/). In this case, the pattern '\s*) matches a single quotation mark followed by a closing bracket between which any non-printable characters may still be located. The replacement text ','') inserts the single quotation mark and the closing bracket, but additionally places a comma and two single quotation marks between them.

The output changes to:

t( 'change', 'Change' )
t('create','Create')
t( 'copy','')
t( 'change' , 'Change' )
t( 'create', 'Build' )
t( 'copy', 'Copy' )

The translation functions found are now standardized so that the key is in the first string parameter and the fallback is in the second string parameter.

With another call of sed, the two strings can be isolated and output in any other way, e.g. as needed for use in a PHP array.

For this, sed must also be enabled to use an extended regular expression with -E. The special characters \1 and \2 reinsert the text that is found within the first and second pair of brackets of the search pattern resp.:

$ find controllers snippets templates -type f -print0 \
 | xargs -0r grep -Eoh "\bt\([^\)]+\)" \
 | tr \" \' \
 | sed -e "/,/! s/'\s*)/','')/" \
 | sed -Ee "s/.*'(.*)'\s*,\s*'(.*)'.*/'\1' => '\2',/"

'change' => 'Change',
'create' => 'Create',
'copy' => '',
'change' => 'Change',
'create' => 'Build',
'copy' => 'Copy',

The output can be inserted directly into a PHP array.

If another output is desired, e.g. in the YAML format mentioned above, the replacement part in the last call to sed must be \1: \2 instead of '\1' => '\2',.

It would be nice if you could output the result sorted. If there were a large number of translations, the problem cases (e.g. different fallbacks for the same key) would be more noticeable.

This is of course not a problem, you just have to pass the result through the sort program – here now in YAML format:

$ find controllers snippets templates -type f -print0 \
 | xargs -0r grep -Eoh "\bt\([^\)]+\)" \
 | tr \" \' \
 | sed -e "/,/! s/'\s*)/','')/" \
 | sed -Ee "s/.*'(.*)'\s*,\s*'(.*)'.*/\1: \2/" \
 | sort

change: Change
change: Change
copy: 
copy: Copy
create: Build
create: Create

Sorting also offers the possibility of removing identical consecutive lines. These represent identical calls to the t() function at different locations in the Kirby program code and only need to be listed once in the translation file.

This is possible either with the -u parameter of the sort program, or another pipe through the uniq program:

$ find controllers snippets templates -type f -print0 \
 | xargs -0r grep -Eoh "\bt\([^\)]+\)" \
 | tr \" \' \
 | sed -e "/,/! s/'\s*)/','')/" \
 | sed -Ee "s/.*'(.*)'\s*,\s*'(.*)'.*/\1: \2/" \
 | sort \
 | uniq

change: Change
copy: 
copy: Copy
create: Build
create: Create

Conclusion

With the skilful stringing together of various small already existing programs, it is possible to solve complex tasks without programming a new tool that would only serve this one purpose.

Each of the programs used only solves a certain sub-task and should it turn out in the course of time that other programs can better solve an intended sub-task, it would be possible to replace it without difficulty.

The prerequisite, however, is the use of a universal interface, such as the Unix pipe in this case.