Hans Ewetz

Friday, April 30, 2021

Taking a first stab at C++20 coroutines

Philosophers' Syndrome: mistaking a failure of the imagination for an insight into necessity.

--Daniel C. Dennett, Consciousness Explained

There is no doubt that the C++20 coroutine framework gives the impression of being a rather complex piece of machinery.

If you are just starting out with C++20 coroutines, I would suggest that you read this blog together with a number of other blogs targeting C++20 coroutines, as well as the specification and the C++ online cpp reference pages.

Here are a few references to articles, papers and blogs that I have found useful:

Before getting started, I apologies for that the C++ syntax highlighting no longer seems to be working in this blog. Tracking down the problem is too time consuming right now and I will try to fix it later.

Writing C++20 coroutines can initially be a duanting task

Before getting started, it might be worth trying to understand why the C++20 coroutine machinery appears to be rather complicated, just so to get an idea of what to watch out for:

The most simple patterns and examples are not supported out of the box by the framework. You have to write a bunch of filler code to get anything working
The framework has many moving parts and each part internally is constructed from chewing gum. In fact, it’s hard to just understand where the variability ends in the framework
The chewing gum aspect makes the framework incredible flexible at the cost of making even simple examples non-trivial to implement
The names given to the various types in the framework are confusing
To understand what can and cannot be done you literally have to understand everything – understanding bits and pieces just won’t cut it when you want to implement something non-trivial

No ... the framework is not really that complicated ... it takes a while to get used to ... thats's all.

A simple generator based on C++20 coroutines

The problem: we want to write a generator that generates n numbers as a sequence: 0, 1, 2, ... n-1.

That's not really spectacularly hard to achieve using standard programming techniques. However, if the generator function maintains a more complicated state (where to continue in the function, temporary variables, etc.) it can become quite hairy.

The solution: use C++20 coroutines to implement the generator.

The coroutine function

Below is the code for a simple generator, generating a sequence of integers as: 0, 1, 2, ... n-1:

#include <coroutine> 
#include <iostream>
#include <optional>


// filler code
…

template<typename T>
struct Gen{
  T value()const;
  optional<T>next();
};

Gen<int>genint(int n){
  for(int i=0;i<n;++i)co_yield i;
}

By filler code I mean customized boiler type code that we have to write so that our code works with the C++20 coroutine machinery.

We are missing the definition of Gen in the code. But had we had it we could have printed the sequence as:

int main(){ 
  auto gen=genint(5);
  while(auto val=gen.next()){
    cout<<val.value()<<endl;
  }
}

with the output:

Before writing the Gen class, let us take a look at the statement:

co_yield i;

Conceptually the statement says:

Save the current state of the coroutine genint
Save the value i somewhere so that the caller can retrieve it by calling value() (or next()) on the gen object
Suspend the coroutine and return to the caller

Exactly how we accomplish this we will see in the implementation of Gen. As you will see, this is where we enter the somewhat confusing and flexible machinery of the C++20 coroutines.

The coroutine function return object

The Gen class looks like:

// definition of val_promise_type
// (to be implemented later)
// …

// Gen implementation
template<typename T>
struct Gen{
  // some typedefs
  using value_t=T;
  using promise_type=val_promise_type<Gen>;

  // keep handle so we can get to promise
  coroutine_handle<promise_type>handle_;

  // ctor
  // (called from 'promise_type::get_return_object()')
  Gen(coroutine_handle<promise_type>handle):handle_(handle){}
  
  // dtor
  ~Gen(){handle_.destroy();}

  // get current value
  value_t value()const{
    return handle_.promise().value();
  }
  // get next item
  optional<value_t>next(){
    if(handle_.done())return std::nullopt;
    handle_.resume();
    return handle_.done()?optional<value_t>{}:value();
  }
}

To start with, the Gen class is using a type called promise_type. In fact, the C++20 framework requires (actually no … but for now let’s pretend it does) the Gen class to contain a type called promise_type. We’ll get to that later, but for now, let’s just assume we have that type defined.

Before continuing, notice that the class Gen< int> is the class the application code deals with when it wants data from a sequence. The code will not deal further with the coroutine function (genint) once it has gotten hold of the Gen<int> object.

Let’s take note of a few things in the code:

Gen implements a method value() to retrieve the last value that the coroutine co_yielded
Gen implements a method next() for stepping to the next value in the sequence

Notice that these are our methods – they are not part of the framework - so we could have named them anything we wanted. In fact, we could even have supplied an iterator interface inside the Gen class. In a more thought through design, we probably would have implemented an iterator interface inside the generator.

So far we have not touched much related to the C++20 coroutine framework. Now we will do just that:

Inside the value() method, we retrieve the yielded value from the promise object. When we implement the promise_type we will see how the value ends up there
Inside the next() method, we use something called a coroutine_handle to check if the coroutine is done. Then we use the same handle to resume the coroutine. Finally, we return the value – a non-null value if the coroutine is not done, and otherwise a null value.

From now on, I will call the object returned by a coroutine – in this case an instance of Gen – the return object. Some articles and blogs calls it a future. However, I think the term future is misleading and clashes with the C++ type std::future typically used in multi threaded programming.

The promise type

Finally we will look at what the promise_type is all about. I will give the implementation right here and I will explain it afterwards:

Before continuing, you might want to take a quick look this diagram: What is inside the runtime coroutine object?

template<typename CRET>
class val_promise_type{
public:
  // we (not the C++20 coroutine framework) require 'return_object' to have a 'value_t' typedef
  using value_t=typename CRET::value_t;

  CRET get_return_object(){return CRET(coroutine_handle<val_promise_type>::from_promise(*this));}
  void return_void(){}
  std::suspend_always yield_value(value_t val){val_=val;return {};}
  std::suspend_always initial_suspend(){return {};}
  std::suspend_always final_suspend(){return {};}
  value_t value(){return val_;}
private:
  value_t val_;
};

The promise_type is a class used by and required by the C++20 coroutine framework. This is what the promise_type functions mean:

get_return_object: called by the framework when the return object from the coroutine is created and returned to the caller
return_void: called when we do a co_return from the coroutine without specifying a return value. In our example, we are not doing this just yet!
yield_value: called by the framework when co_yield is called in the coroutine functions
initial_suspend: when the coroutine is first called, this method controls how the execution of the routine should proceed. In our promise_type we return an object suspend_always which means that the coroutine body will not execute on the first call
final_suspend: similar to initial_suspend except this method is called right before the coroutine complets execution

There is a lot more to say about what methods are called when and under what circumstance they are called, what the implications are and as well as how the compiler finds the promise_type. For right now I won’t get into this since it would muddle the generator example.

The complete code

We now have all the parts: the promise_type, the return object and the coroutine function. Let’s put all of it inside a single code snippet:

#include <iostream>
#include <vector>
#include <optional>
#include <queue>
#include <coroutine>
using namespace std;

// --- (begin) promise object
// promise used when we need to store a single value
template<typename CRET>
class val_promise_type{
public:
  // we require 'return_object' to have a 'value_t' typedef
  using value_t=typename CRET::value_t;

  // these functions are required by teh C++20 coroutine machinery
  CRET get_return_object(){return CRET(coroutine_handle<val_promise_type>::from_promise(*this));}
  void return_void(){}
  std::suspend_always yield_value(value_t val){val_=val;return {};}
  std::suspend_always initial_suspend(){return {};}
  std::suspend_always final_suspend(){return {};}
  
  // this is our own function (not required by the C++20 coroutine machinery
  value_t value(){return val_;}
private:
  value_t val_;
};
// --- (end) promise object

// -- (begin) couroutine return object for generator
template<typename T>
struct Gen{
  // some typedefs
  using value_t=T;
  using promise_type=val_promise_type<Gen>;

  // keep handle so we can get to promise
  coroutine_handle<promise_type>handle_;

  // ctor
  // (called from 'promise_type::get_return_object()')
  Gen(coroutine_handle<promise_type>handle):handle_(handle){}
  
  // dtor
  ~Gen(){handle_.destroy();}

  // get current value
  value_t value()const{
    return handle_.promise().value();
  }
  // get next item
  optional<value_t>next(){
    if(handle_.done())return std::nullopt;
    handle_.resume();
    return handle_.done()?optional<value_t>{}:value();
  }

};
// -- (end) coroutine return object

// --- (begin) generators

// simple integer sequence
Gen<int>genint(int n){
  for(int i=0;i<n;++i)co_yield i;
}
// we’ll add a few more generators here later
…

// --- (end) generators

// run generator and print result
void genprint(auto&gen){
  while(auto val=gen.next()){
    cout<<val.value()<<endl;
  }
}
// test program
int main(){
  cout<<"----- int sequence"<<endl;
  auto gen=genint(5);
  genprint(gen);
}

Using gcc (GCC) 10.2.0 you can compile and link the code with the following command:

g++  -std=c++20 -fcoroutines  -fno-exceptions -o test1 test1.cc

What just happened?

Before continuing, you might want to take a quick look this diagram: What is inside the runtime coroutine object?

Here are a few points that might help in understanding how the code works:

when the coroutine genint is called the framework creates (among other things) a promise_type object by calling the get_return_object on the promise object
in the get_return_object we construct a Gen<int> object which is then returned to the caller. Notice that we stuff a coroutine_handle object in the returned object
later we call next on the returned object which in turn calls resume on the coroutine_handle
the resume call tells the framework to resume execution of the coroutine where it last left off. In our case, if this is the first time it will enter the for-loop in the coroutine function
the coroutine function executes until it gets to the co_yield statement. This statement triggers the framework to execute the yield_value function in the promise object passing the value we gave to co_yield as parameter
in the yield_value function we stuff the value in the val_ attribute so it can later be retrieved
now we are back in the next function in the Gen object. There we check if the coroutine is done. If the coroutine is not done we retrieve the value we stuffed in the promise and return it to the caller
the caller (in our case genprint) prints the value to stdout

There are a number of rules for how the framework behaves depending on what we do and what we return from methods in the return object and methods in the promise object. But for now, we won’t delve into those rules until we have a better grasp on how the whole machinery works.

More generators

The bad thing is that we had to write a lot of filler code to get a simple generator to work. The good thing is that we can now write more generators without writing more filler code.

BFS graph traversal

We can now use the generator we wrote to traverse a graph in BFS order.

The graph is defined using an adjacency list representation – that is, if vertex 5 has edges pointing to vertex 7, 8 and 10 then g[5]={7, 8, 10}:

// simple BFS traversal of graph
using graph_t=vector<vector<int>>;
enum color_t{white=0,grey=1,black=2};

Gen<int>genbfs(graph_t const&g,int ustart){
  if(g.size()==0)co_return;
  vector<color_t>color(g.size(),white);
  queue<int>q;
  q.push(ustart);
  color[ustart]=grey;
  while(!q.empty()){
    int u=q.front();
    q.pop();
    co_yield(u);
    for(int v:g[u]){
      if(color[v]!=white)continue;
      q.push(v);
      color[v]=grey;
    }
    color[u]=black;
  }
}

An example of BFS traversal of a graph is the following code which can be copied and pasted into the main program:

cout<<"----- bfs graph"<<endl;
graph_t g{
    {1, 2},
    {4},
    {3},
    {4},
    {5},
    {}
  };
  auto bfsgen=genbfs(g,0);
  genprint(bfsgen);

Character reader + tokenizer

A more interesting example, where two coroutines cooperate, is an example where one coroutine reads data from a source and then feeds the data to a tokenizer:

// character reader + tokenizer
Gen<char>genreadchar(string const&str){
  for(char c:str)co_yield c;
}
Gen<string>gentok(Gen<char>&creader){
  string currtok;
  while(creader.next()){
    char c=creader.value();
    if(c==' '){
      if(currtok.length()){
        co_yield currtok;
        currtok.clear();
      }
    }else{
      currtok.push_back(c);
    }
  }
  if(currtok.length())co_yield currtok;
}

The code to run this from the main program is then:

  cout<<"----- tokenizer graph"<<endl;
  string inpstring="Hello again where is the cheese";
  auto creader=genreadchar(inpstring);
  auto tokreader=gentok(creader);
  genprint(tokreader);

Permutations

We can write a final generator that creates all permutations of a string. We’ll use the Knuth algorithm (Knuth, Volume 4A, page: 319) that generates permutations in lexicographical increasing order:

// generate permutations in lexicographical order
Gen<string>genperm(string s){
  int n=s.length();

  // sort in ascending order and return the result as the first permutation
  sort(begin(s),end(s));
  co_yield(s);

  if(n==1)co_return; 
  while(true){
    // find largest k such that s[k]<s[k+1]
    int k=n-2;
    for(;k>=0;--k){
      if(s[k]<s[k+1])break;
    }
    if(k<0)co_return;

    // find largest l>k such that s[k]<s[l]
    int l=n-1;
    for(;l>k;--l){
      if(s[k]<s[l])break;
    }
    swap(s[k],s[l]);
    reverse(begin(s)+k+1,end(s));

    // we now have the next permutation
    co_yield s;
  }
}

And the call from the main program is:

  // generate permutation of string in lexographical order
  string str="1234";
  auto permgen=genperm(str);
  genprint(permgen);

What is inside the runtime coroutine object?

I will start by drawing a picture that conceptually describes what is going on:

Converted document

An important interface to the C++20 coroutine machinery is the awaitable/awaiter interfaces. These interfaces are not part of this diagram. I excluded them since there are already enough moving parts in the diagram. I will describe them in the next blog entry.

An important observation is the triplet [bookkeeping, frame, promise]. The triplet is the bundle that represents a coroutine object. The bookkeeping part contains information about where to start when a coroutine is resumed. The frame contains temporary variables, variables in the coroutine and parameters to the coroutine. The promise is the part we have some control over. For example, the yield_value function is called when co_yield is invoked from the coroutine. In that function we have a hook to store the value so we later can retrieve it.

Using gcc10 in a Docker based environment

I wrote and tested the code within a Docker image to avoid polluting the machine with more versions of C++ compilers and C++ environments. However, if you already have a C++20 compliant compiler installed, you can safely skip this section. Below is a brief description of how my environment was configured.

Be aware that I'm not an expert in configuring Docker. However, my basic setup appears to work OK.

The setup contains two files:

Dockerfile
Makefile

The Dockerfile lays out the docker image and looks like this:

FROM gcc:latest

# setup init process
ENV TINI_VERSION v0.19.0
ADD https://github.com/krallin/tini/releases/download/${TINI_VERSION}/tini /tini
RUN chmod +x /tini
ENTRYPOINT ["/tini", "-g", "-p", "SIGTERM", "--"]

# install vim (does not come with gcc:lates
RUN apt-get update && apt-get install -y vim

# create/configure new user
ARG arg_user
ARG arg_group
ARG arg_uid
ARG arg_gid

RUN echo $arg_gid
RUN echo $arg_group

RUN groupadd -g $arg_gid $arg_group
RUN useradd -u $arg_uid -g $arg_gid -ms /bin/bash $arg_user
WORKDIR /home/$arg_user

COPY vimrc /home/$arg_user/.vimrc
RUN chown $arg_user .vimrc
RUN chgrp $arg_group .vimrc

COPY bashrc /home/$arg_user/.bashrc
RUN chown $arg_user .bashrc
RUN chgrp $arg_group .bashrc

# run container as $user_arg
USER $arg_user

# start bash
CMD /bin/bash

As you probably can infer from the Dockerfile that the user within the docker image will have the same user name and user id as the user building the image (not as the one running the container).

The Makefile is a quick hack that contains simple commands for building the Docker image as well as starting a container. Additionally it contains a few commands for removing the image, removing containers etc.:

IMAGE=hans-cpp20

all: usage

run:
        docker run -v `pwd`/src://home/$${USER}/src --name $(IMAGE) --rm  -it $(IMAGE) 

build:
        docker build --build-arg arg_user=$$(id -un) --build-arg arg_group=$$(id -gn) --build-arg arg_uid=$$(id -u) --build-arg arg_gid=$$(id -g)  -t $(IMAGE):latest .

rmimage:
        docker image ls | grep $(IMAGE) | awk '{print $$3}' | xargs -I{} docker image rm "{}"

lsimage:
        docker image ls | grep $(IMAGE) 

rmcont:
        docker container ls -a | grep $(IMAGE) | awk '{print $$1}' | xargs -I{} docker container rm "{}"

lscont: 
        docker container ls -a | grep $(IMAGE)

save:
        docker save -o $(IMAGE).tar hans-centos-test1

load:
        docker load -i $(IMAGE).tar

prune:
        docker image prune -f
        docker container prune -f

clean: rmcont rmimage
        rm -f junk.txt *.tar

usage:
        @echo "make all                 # print this message"
        @echo "make run                 # run docker image"
        @echo "make build               # build docker image"
        @echo "make rmimage             # remove image"
        @echo "make lsimage             # list image"
        @echo "make rmcont              # remove container(s)"
        @echo "make lscont              # list container(s)"
        @echo "make save                # save image to tar file"
        @echo "make load                # load image from a saved tar file"
        @echo "make prune               # prune stopped containers and dangling images"
        @echo "make clean               # remove container + image"

To build the Docker image run the following command:

make build

To run a container, execute the following command:

make run

To exit the container do:

^D

The setup maps the src directrory in the current directory to directory /home/<user>/src iside the Docker container. So anything you do inside or outside the container to this directory is reflected inside and outside Docker.

Next: handling exceptions and scheduling coroutines for execution in a thread pool

In the first followup blog entry I will show how exceptions are managed within the framework of C++20 coroutines. The examples in this blog did not deal with exceptions. In fact, on the compile line we explicitly specified that exceptions would not be supported.

In the blog entry following exception handling I will show how to move coroutine objects into a thread pool for scheduled execution. The code will use awaiter objects to control how a coroutine is suspended when the co_await operator is executed.

Friday, November 22, 2019

A transactional extension to para

He said that for a sorcerer, the world of everyday life is not real, or out there, as we believe it is. For a sorcerer, reality, or the world as we all know, is only a description

-- Carlos Castaneda, Jurney to Ixtlan, The lessons of Don Juan.Washington Square Press 1972, page viii

Introduction

In a previous blog entry, Efficient processing of line based text, I described the program para. In the blog I described how processing time can be significantly reduced by using para to spread the load of processing lines from an input stream across multiple identical processes each processing one line at a time.

When running para on very large files where processing time is counted in days rather than minutes, it would be beneficial to be able to restart from a well known point should processing crash the middle instead of having to re-run from the beginning.

In this blog entry I describe in a very short overview an extension to para that supports a simple form of transactions that allows recovery from crashes in a safe and predicable way. More detailed documentation can be found in the github README file

Transactions in para

Transactions in para are committed every N lines where N is specified as a command line parameter. When a transaction is committed a transaction log is written to a file on disk. A transaction log is either new or replaces the previous transaction log, In either case, the log is created atomically.

If a transaction log exist and para is started in recovery mode, para will read the transaction log and adjust it's processing based on the information in the log. For example, say the transaction log contains the following information:

#lines committed: 10
position in output file: 1038

When para recovers it will skip the first 10 lines and seek to position 1038 in the output stream (if the stream corresponds to a file) and then continue normal processing.

Examples

To check if para will attempt to recover from a stored transaction-log the following command can be executed:

$ para -r

para will output something like:

recovery info --> #lines-committed: 1100, outfile-pos: 133664

Say we have input.txt:

and an executable exe.bash:

while read line; do
  echo $line 
  sleep 1
done

and we execute (-R is recovery and -C 2 enables transactions every 2 lines):

$>para para -v -i input.txt -o output.out -R -C 2  -- 1 ./exe.bash

the output.txt is then:

Now, if we execute the same command but hit ^C after seeing the message:

info: committing at 2 lines...

The output is now:

1
2

or possibly:

1
2
3

and the file: .para.txnlog contains (using od -x .para.txnlog):

0000000 0002 0000 0000 0000 0004 0000 0000 0000
0000020

We can view the transaction log by executing:

$para -r

para informs that 2 lines were committed and output in recovery mode starts at position 4 in the output file:

recovery info --> #lines-committed: 2, outfile-pos: 4

Now we can restart para:

$>para para -v -i input.txt -o output.out -R -C 2 -- 1 ./exe.bash

para now writes:

info: skipping first: 2 lines in recovery mode, outfilepos: 4 ...
info: positioning to offset: 4 in output stream
info: committing at 4 lines ...
debug: #timers in queue: 1 (expected: 1 HEARTBEAT timer)
info: committing at 5 lines ...
debug: closing files ...
debug: waiting for child processes ...
debug: cleaning up memory ...
debug: ... cleanup done

We see that para skips 2 lines in the input file and positions at file position 4 in the output file before starting to process. The output file is now:

Thursday, July 18, 2019

Efficient processing of line based text

The less we understand something, the more variables we require to explain it.

-- Ackoff, The Art Of Problem Solving (p. 111)

Introduction

This blog entry contains a few notes about the para program I put on github (https://github.com/hansewetz/para). More detailed documentation can be found in the github README file.

Background to the para program

A typical processing task in computational linguistics applies some processing function to each line in a file. The result of processing a line is written to some output - typically stdout. A constraint on the processing is normally that the generated output must be be in the same order as the lines read from the input.

The idea behind the para program was (as far as I know) invented by one or more computational linguists. The problem para solves is to parallelize processing of lines in a text file while ensuring that the result of processing line N is written to the output as line N.

A small example

Say you want to calculate the md5 sum of each line in a file. First we'll write a small bash script md5.bash:

#!/bin/bash
while read line; do
  echo $line | md5sum | awk '{print $1}'
done

The following command:

$ echo hello | ./md5.bash

will print:

b1946ac92492d2347c6235b4d2611184

Now say you have a very large file containing many millions of lines of text and want to calculate the md5 sum for each line in the file. If the file monolingual.en.gz is a gzipped file containing the text to process, we can execute the command:

$ time zcat monolingual.en.gz | ./md5.bash > out1.md5sum

On one of the machines I use I get the following timing information:

real    21m22.874s
user    10m18.485s
sys     28m48.447s

The problem when calculating the md5 sum for many millions of lines is that it takes time. Using para we can speed up the calculation by running 5 processing units in parallel:

$ time zcat monolingual.en.gz | para -- 5 ./md5.bash > out2.md5sum

Running the example on the same machine as I previously executed md5 on I now get the time:

real    4m34.788s
user    10m42.282s
sys     29m59.132s

We can see that the processing time is approximately 5 times faster. This is because para was executed with 5 sub-processes, each running in parallel.

How does para work?

para internally starts a number of sub-processes. In our example para started 5 sub-processes where each sub-process executed the command specified on the para command line.

Internally para sets up a pipe, to be precise a full duplex socket pair, between itself and each sub-process. Each line read from the input is fed into one of the sub-processes. Ones a line is sent to a sub-process para does some book keeping tracking the order of lines read and lines sent to sub-processes.

When a sub-process has finished processing a line the result is read by para through the socket pipe. para through its book keeping machinery will either write the result to its output, or alternatively buffer it internally when writing the output would violate output ordering. If the result is buffered para, will track when it can write the buffered result to output so that the input and output ordering stays aligned.

Comments on implementation

For portability reasons the code is written in plain C. My preferred choice would have been C++ together with boost since plain C does not have the same type of standard underlying support libraries. Additionally the language is ... let's say ... pretty limited compared to modern C++. However, C++ and boost create a number of installation and portability issues which are not always easy to solve.

Because of the limited standard library support for plain C data structure, some data structures together with algorithms were implemented from scratch. For example, priority queues, queues and buffers were implemented in the para source code tree.

Improvements to be implemented

para writes one line at a time to a sub-process. If the sub-process is not time consuming, reading and writing from/to sub-processes can easily become the bottle neck where para consumes most of the CPU time. A conceptually straight forward optimization in para would implement writing of multiple lines in one shot to sub-processes.

para buffers have fixes size and cannot be dynamically re-sized. This can be annoying since it requires a user to set a maximum length for lines before running para or run with the default setting and hope that lines are not longer than the default maximum line length. In reality this is not such a big problem since para will not crash in an uncontrolled fashion. If a line is longer than the buffer size para will simply terminate with an error message.

Features to be implemented

para should be able to read input by connecting to a network based server which feeds para with lines of text. para should also be able to connect and send output to a network based server. Exactly how to control number of lines read from the network based server must be thought through and specified.

Download and install para

para can be downloaded and installed from https://github.com/hansewetz/para. Building and installing para is a straight forward cmake based process detailed as 5 simple steps on the github page.

Tuesday, October 16, 2018

An API for configurations in a UNIX environment

Ruthless in purpose insidious in method

-- Orphan Black TV Series, season 3 Episode 8

Introduction

>>>The code together with Getting started documentation is available under github.<<<
>>>For a quick start please see: github<<<.

Some months ago I ran into a problem related to passing configuration information to a set of applications. I wanted to retrieve configuration information in the form of wrkdir = `pwd`/"wrkdir"-${HOSTNAME}/${USER}. Additionally I needed to access the configuration information from bash scripts, C++ programs and Python programs.

A configuration standard together with an API that would support what I needed was nowhere to be found.

In this post I describe a first cut of a small, simple but quite powerful configuration language together with a supporting API. It can be used in bash programs as well as in C++ programs. In later posts I will show an implementation of the corresponding Python API.

A small example

I'll use the following configuration file - test.cfg - to illustrate how the configuration machinery works:

namespace sys{
  du=`df -k . | grep '%' | grep -v Use | awk '{print $5}'`
  user=${USER}                                                
  usermachine=@"%{user}@`hostname -i`"
}
${USERMACHINE}=sys.usermachine    # export USERMACHINE environment variable

The configuration code illustrates:

line#1: namespace for grouping of variables
line#2: assign the output from execution of a command to a configuration file variable
line#3: assign an environment variable to a configuration file variable
line#4: interpolate a string containing the value of a configuration file variable as well as the result of a command
line#6: export the environment variable USERMACHINE into the environment

We can view the configuration by using the xconfig utility:

xconfig test.cfg

the output from xconfig is:

sys_du="51%"
sys_user="hansewetz"
sys_usermachine="hansewetz@10.0.0.4"

We can also read the configuration file from a C++ program as shown here:

#include "xconfig/XConfig.h"
#include <iostream>
using namespace std;
using namespace xconfig;

int main(){
  XConfig xfg("test.cfg");
  for(auto&&[name,value]:xfg())cout<<name<<": "<<value<<endl;
}

the output after running the program is:

sys.du: 51%
sys.user: hansewetz
sys.usermachine: hansewetz@10.0.0.4

The for loop prints all variables from the configuration file. The separator between the namespace and a variable name is a dot in this output. In the output from xconfig the separator is an underscore so that the variable can be used in a bash context.

xconfig utility program

Not yet done

Requirements on configuration language

There are a few basic requirements that I put on the configuration language and the supporting API. To start with, it must be possible to access environment variables. I also want to be able to set variables in the current environment.

Next, I should be able to retrieve the output from a command and store it in a variable. I must also be able to concatenate strings and/or integers to form longer strings.

To avoid name clashes in the space of variable names it must be possible to create logical groupings of variables. In the previous example the namespace construct was used to group variables in a sys namespace.

For convenience, I want to be able to catenate environment variables, output from commands, normal variables as well as strings and integers inside a string. In the example above I used interpolation (activated by prepending the string with an at sign) to evaluate the value assigned to the variable usermachine.

Requirements on the configuration API

Requirements on the API are fairly simple. First it must be possible to get a complete listing of all variables in the configuration file. For convenience it must also be possible to select variables by name, by regular expression or by namespace.

Before retrieving variables the configuration must be readable from a file or a stream. The API should expose a single method to make this as painless as possible.

Finally, for debug purpose it must be possible to dump various types of information related to the parsing of a file . For example, in a complex configuration file we might want to see how the internal machinery of the parsing evaluates certain expressions.

Additional requirements

The configuration language should not bea full fledged programming language. The syntax should be simple, it must be easy to read and it must be fairly easy to learn. Specifically I want to avoid an XML type syntax since large XML files are notoriously difficult to read by humans.

Configuration language

The language follows a C like style where variable names consists of alphanumeric characters including the underscore character. Variables can be assigned string or integer values:

user = "hansewetz"             # 'user' is a string
maxthread = 20                 # 'maxthreads' is an integer

Environment variables are accessed by using a bash style notation:

user = ${USER}             # 'user' is assigned the value of environment variable 'USER'
user1 = $USER              # works without braces also

Normal variable s are access by using their name directly or by prefixing them with '%':

user = @"${USER}"                        # 'user' is assigned the value of environment variable 'USER'
userAtMachine= @"%{user}@`hostname -i`"  # access variable 'USER' inside a string

Commands can be executed by enclosing the the command inside back-quotes:

machine = `hostname`   # get name of machine

The plus operator concatenate strings and adds integers:

user = "hans"+"ewetz"         # user = "hansewetz"
minthreads = 10
maxthreads = minthreads + 8   # maxthreads = 18

An expression on the right hand side of the assignment operator can access a variable using dot separated namespaces:

namespace system{
  user = $USER                     # user = "hansewetz"
  machine = `hostname -i`          # machine = "10.0.0.4"
}
userAtMachine = @"%{system.user}@%{system.machine}"   # userAtMachine = "hansewetz@10.0.0.4"

As the previous example showed strings can be interpolated. During interpolation environment variable, normal variables as well as commands are interpolated. A variation of the previous example is shown here:

namespace system{
  user = "@USER"
  machine = `hostname -i`          # machine = "10.0.0.4"
}
userAtMachine = @"%{system.user}@%{system.machine}"   # userAtMachine = "hansewetz@10.0.0.4"

There are some restrictions on how variables can be used. A variable cannot be assigned to once it has a value. Neither can a namespace qualified variable be used on the left hand side of the assignment operator.

Design notes

The design is relatively straight forward:

bison/flex is used for parsing the configuration
during parsing opcodes are written as a program into a virtual machine
the virtual machine executes the program populating the virtual maches memory with the result
a user uses an extractor to retrieve data from the memory of the virtual machine

The design is shown in this diagram:

Implementation notes

The virtual machine is implemented as a simple stack machine tailored specifically for this project. The name of the virtual machine is MMVM - Mickey Mouse Virtual Machine. The MMVM currently supports 13 opcodes. Among them are simple operation such as 'push value on stack' or 'store value in memory'. More complex operations such as 'evaluate a command in a shell and store output on stack' are also supported.

Now, one might wonder if it is not overkill to implement a VM just for parsing a configuration file. As it turns out, using a VM makes it almost effortless to support and experiment with new features in the configuration language. Since the MMVM implementation is only around 330 lines of C++ code it can be implemented in only a few hours.

The MMVM currently supports strings and integers. The support is not extensive. Two integers can be added, two strings can be concatenated and integers can be numerically added.

Evaluation of a configuration file is done in two steps:

compile the file into a program (generate a program consisting of op codes for MMVM)
execute program (sequence of opcodes

Namespaces are used to scope variables. The full name of a variable is a series of dot separated namespaces followed by the name of the variable. It is not necessary to use a fully namespace qualified name of a variable. When namespaces are left out, the compiler looks for the variable in the enclosing scope(s) . For example:

# test1.cfg
a = 18
namespace ns{
  namespace ns1{
    b = 19
    namespace ns2{
      c = b       # 'b' is found one step up
      d = ns1.b   # same as previous
      e = a       # 'a' is found in outer most scope
    }
  }
}

The output from executing: xconfig test1.cfg:

a="18"
ns_ns1_b="19"
ns_ns1_ns2_c="19"
ns_ns1_ns2_d="19"
ns_ns1_ns2_e="18"

When interpolating strings it must be possible to handle namespaces the same way as they are handled when referencing variables outside string interpolation. In order to reference variables without specifying the full namespace path, we need a table that contains information about the location of a variables within namespaces. This is done by building the table at compilation time and using it at runtime during string interpolation.

Downloading and building the xconfig API

Please see xconfig under github.

Bugs

Not yet done

Improvements

Not yet done

Thursday, August 16, 2018

A short signal-handling-hack

… unbounded knowledge is an adventure at the boundary of uncertainty …

-- Jacob Bronowski (The Acent of man)

Introduction

Following is a short simple hack implementing an RAII based signal handling guard passing signals as template parameters.

The guard constructor sets a signal handler for a list of signals and the destructor resets the original signal handlers. The code makes use of variadic templates in a way I find pretty cool.

There is not much to this blog entry except for the code showing how to apply a function to each element in a list of template parameters.

Using the code

A small example using standard RAII style shows how the signal guard can be used:

// test program
int main(){
  {
    // install a signal handler for some signals
    sigguard<SIGINT,SIGHUP>([](int sig){});

    // code, code, code ...
    // ...
  }
  // old signal handler re-installed when 'sigguard' destructor runs
  // ...
}

The code

#include <stack>
#include <iostream>
#include <utility>
#include <csignal>
#include <cstring>

template<int...SIGS>
class sigguard{
public:
  // ctor
  sigguard(sighandler_t fsig):fsig_(fsig){
    setsigs(std::integer_sequence<int,SIGS...>());
  }
  // dtor
  ~sigguard(){
    while(!oldhandlers_.empty()){
      auto const[sig,fsig]=oldhandlers_.top();
      oldhandlers_.pop();
      signal(sig,fsig);
    }
  }
  private:
    // install signal handler for list of signals
    template<int...sigs>
    void setsigs(std::integer_sequence<int,sigs...>){
      using xxx=int[];
      (void)xxx{(setsig(sigs),0)...};
    }
    // set a single signal handler and save the old one so it can be set in the dtor
    void setsig(int sig){
      auto oldhandler=signal(sig,fsig_);
      if(oldhandler==SIG_ERR){
        throw std::runtime_error(std::string("could not set signal handler: ")+strsignal(sig));
      }
      oldhandlers_.push(std::make_pair(sig,oldhandler));
    }
    // old signal handlers
    std::stack<std::pair<int,sighandler_t>>oldhandlers_;

    // signal handler function
    sighandler_t fsig_;
};

Thursday, May 4, 2017

Evaluation of Translation Quality using Neural Networks - a Proposal

Organizations which design systems are constrained to produce designs which are copies of the communication structures of these organizations

-- Conway’s law

In this blog post I present an idea for how human evaluation of translation quality can be incorporated into a method for determining the quality of systems that automatically translates from one (human) language to another.

Before getting started it is appropriate to issue a spoiler alert! This post does not present any concrete results. The main point of this blog post is to present an idea that I (or someone else) will verify through experiments. In fact, the idea is not presented as an algorithm but as a method which can be implemented in a variate of ways.

The Problem

When evaluating the quality of automatically translated text it is not so easy to algorithmically specify what it means for a translation to be good or bad or maybe just OK. Yes, there are numerous measures that can be concocted that will indicate that a translation is in some sense close to a reference text. But, being close based on some measure such as edit distance or precision does not say much about the quality of the translation.

In this short blog post I outline a simple method that should generate quality scores that are comparable to quality scores created by human evaluation. I say should since I have only done a few experiments with a small amount of data (while at the same time learning about tensorflow and neural networks).

There are a number of measurements that can be used to evaluate if a sentence is close to a reference sentence. Some measurements are:

edit distance
longest common subsequence

There are of course numerous other measures (or combination of measures) that can be used to indicate if a sentence is in one way or another close to another sentence.

The problem with such technical measures (or features) is that they say little about the quality of a translation. This should not come as a surprise since there is no algorithmic definition of what constitutes a good translation. From my experience working with computational linguists the only method that can generate a reliable quality measure is to have a human evaluate the quality using his or hers language skills.

I'll give an example that illustrates the problem with not being able to reliably evaluate the quality of translation. When comparing SMT (statistical machine translation) systems with NNT (neural network translation) systems (or any other type of system) a precision based score called BLEU is often used as a quality measure. Listening in on comments from computational linguists I have heard comments like: the NNT translation is much better but have about the same BLEU score as the SMT translation. Surprisingly, even though it is patently clear that BLEU is not at all a reliable measure of quality, it is still being used - mostly for the reason that it is the best there is.

The Idea

Since we don't have an algorithmic (or formulaic) description of what constitutes translation quality its clear that human evaluation must play a part if we want a create a quality measure that makes sense. The question is: how can human evaluation be used in an automated method that calculates a translation quality score. Certainly, we do not want a human to evaluate every single translated sentence.

The solution is conceptually quite simple. Given a reference sentence and a translated sentence we evaluate a number of measurements (or features) over the sentence pair. The sentence pair is also evaluated and scored by a human. For example, the score could be a number pulled from a range - say 1 ... 10. The feature values together with the human evaluated score can now serve as training data for a neural network.

Once the network has been trained, it can be used when evaluating the translation quality of one or more sentences (as long as the translated sentences have corresponding reference sentences). A set of features are first evaluated on the translated sentences and the reference sentences after which they are fed into the neural network. The output from the neural network is now a quality score which should be comparable to a human evaluation of the translation.

Why should this approach work? The reasoning goes along the following lines. Each feature should in one way or another state something about translation quality (or lack thereof). It seems reasonable that some (non-linear) combination of the calculated features should capture what constitutes translation quality. The problem is that we don't know how to combine features so that we can calculate a meaningful quality score. A simple brute force way of creating a non-linear combination of features is to train a neural network so that it learns what translation quality means in terms of features.

Now, this may or may not work. Clearly we must include the right features. What the right features are is probably not a trivial task to decide on. However, it should not hurt to include features which are not essential since such features should be deemed irrelevant during the training of the neural network. If we miss some essential features it should show up as poor test results for the neural network.

What I have done so far

As mentioned earlier, this post is mostly about the idea or method - not about presenting the result of a full blown experiment. Due to lack of time, I have only implemented a small experimental 3 layer neural network using tensorflow where the input was defined by the following 8 features:

edit distance (word and character)
longest common subsequence (word and character)
longest common sequence (word and character)
precision
accuracy

In total I manually evaluated 100 sentence pairs assigning a quality score to each one (which even by a long shot is probably not enough training data). The quality scores together with calculated features were used to train the neural network. The network was then used (in an ad-hoc non-systematic fashion) to evaluate segments generated by a translation system against their corresponding reference sentences. The result showed that the network mapped features relatively close (using manual inspection) to scores I assigned to reference segments.

Unfortunately I have not had the time to gather more training data and do a proper objective systematic evaluation of the results. At this stage this project is still at the stage where I consider it a cool idea that needs further work.

Hopefully I will have time to work on the idea presented here in the following weeks. I also hope to be able to present a systematic objective result from a trained quality evaluation network. Until then, if someone has ideas for why the method described here will or will not work, feel free to let me know.

A few random observations

A potential drawback of the method is that each language requires its own quality evaluation network. Practically this means that training data for each language must be manually prepared which involves a non-negligible amount of work. However, once a network has been trained it should be usable when evaluating any automatic translation system.

A Computational Linguist colleague of mine suggested that it might be possible to use a quality evaluation network as an error function when training a translation system (SMT or NNT). If the method presented in this blog works OK the idea would be worth investigating.

Why did I choose a 3 layer design of the neural network - why not 2? The reason was rather simple: the idea was that the middle layer would create abstractions that would be aligned with human related language qualities - for example fluency and similarity. I don't have enough knowledge of neural network design to really say if the idea is correct or not. Maybe after understanding more about neural networks I will be able to verify it.