Hans Ewetz

Friday, April 30, 2021

Taking a first stab at C++20 coroutines

Philosophers' Syndrome: mistaking a failure of the imagination for an insight into necessity.

--Daniel C. Dennett, Consciousness Explained

There is no doubt that the C++20 coroutine framework gives the impression of being a rather complex piece of machinery.

If you are just starting out with C++20 coroutines, I would suggest that you read this blog together with a number of other blogs targeting C++20 coroutines, as well as the specification and the C++ online cpp reference pages.

Here are a few references to articles, papers and blogs that I have found useful:

Before getting started, I apologies for that the C++ syntax highlighting no longer seems to be working in this blog. Tracking down the problem is too time consuming right now and I will try to fix it later.

Writing C++20 coroutines can initially be a duanting task

Before getting started, it might be worth trying to understand why the C++20 coroutine machinery appears to be rather complicated, just so to get an idea of what to watch out for:

The most simple patterns and examples are not supported out of the box by the framework. You have to write a bunch of filler code to get anything working
The framework has many moving parts and each part internally is constructed from chewing gum. In fact, it’s hard to just understand where the variability ends in the framework
The chewing gum aspect makes the framework incredible flexible at the cost of making even simple examples non-trivial to implement
The names given to the various types in the framework are confusing
To understand what can and cannot be done you literally have to understand everything – understanding bits and pieces just won’t cut it when you want to implement something non-trivial

No ... the framework is not really that complicated ... it takes a while to get used to ... thats's all.

A simple generator based on C++20 coroutines

The problem: we want to write a generator that generates n numbers as a sequence: 0, 1, 2, ... n-1.

That's not really spectacularly hard to achieve using standard programming techniques. However, if the generator function maintains a more complicated state (where to continue in the function, temporary variables, etc.) it can become quite hairy.

The solution: use C++20 coroutines to implement the generator.

The coroutine function

Below is the code for a simple generator, generating a sequence of integers as: 0, 1, 2, ... n-1:

#include <coroutine> 
#include <iostream>
#include <optional>


// filler code
…

template<typename T>
struct Gen{
  T value()const;
  optional<T>next();
};

Gen<int>genint(int n){
  for(int i=0;i<n;++i)co_yield i;
}

By filler code I mean customized boiler type code that we have to write so that our code works with the C++20 coroutine machinery.

We are missing the definition of Gen in the code. But had we had it we could have printed the sequence as:

int main(){ 
  auto gen=genint(5);
  while(auto val=gen.next()){
    cout<<val.value()<<endl;
  }
}

with the output:

Before writing the Gen class, let us take a look at the statement:

co_yield i;

Conceptually the statement says:

Save the current state of the coroutine genint
Save the value i somewhere so that the caller can retrieve it by calling value() (or next()) on the gen object
Suspend the coroutine and return to the caller

Exactly how we accomplish this we will see in the implementation of Gen. As you will see, this is where we enter the somewhat confusing and flexible machinery of the C++20 coroutines.

The coroutine function return object

The Gen class looks like:

// definition of val_promise_type
// (to be implemented later)
// …

// Gen implementation
template<typename T>
struct Gen{
  // some typedefs
  using value_t=T;
  using promise_type=val_promise_type<Gen>;

  // keep handle so we can get to promise
  coroutine_handle<promise_type>handle_;

  // ctor
  // (called from 'promise_type::get_return_object()')
  Gen(coroutine_handle<promise_type>handle):handle_(handle){}
  
  // dtor
  ~Gen(){handle_.destroy();}

  // get current value
  value_t value()const{
    return handle_.promise().value();
  }
  // get next item
  optional<value_t>next(){
    if(handle_.done())return std::nullopt;
    handle_.resume();
    return handle_.done()?optional<value_t>{}:value();
  }
}

To start with, the Gen class is using a type called promise_type. In fact, the C++20 framework requires (actually no … but for now let’s pretend it does) the Gen class to contain a type called promise_type. We’ll get to that later, but for now, let’s just assume we have that type defined.

Before continuing, notice that the class Gen< int> is the class the application code deals with when it wants data from a sequence. The code will not deal further with the coroutine function (genint) once it has gotten hold of the Gen<int> object.

Let’s take note of a few things in the code:

Gen implements a method value() to retrieve the last value that the coroutine co_yielded
Gen implements a method next() for stepping to the next value in the sequence

Notice that these are our methods – they are not part of the framework - so we could have named them anything we wanted. In fact, we could even have supplied an iterator interface inside the Gen class. In a more thought through design, we probably would have implemented an iterator interface inside the generator.

So far we have not touched much related to the C++20 coroutine framework. Now we will do just that:

Inside the value() method, we retrieve the yielded value from the promise object. When we implement the promise_type we will see how the value ends up there
Inside the next() method, we use something called a coroutine_handle to check if the coroutine is done. Then we use the same handle to resume the coroutine. Finally, we return the value – a non-null value if the coroutine is not done, and otherwise a null value.

From now on, I will call the object returned by a coroutine – in this case an instance of Gen – the return object. Some articles and blogs calls it a future. However, I think the term future is misleading and clashes with the C++ type std::future typically used in multi threaded programming.

The promise type

Finally we will look at what the promise_type is all about. I will give the implementation right here and I will explain it afterwards:

Before continuing, you might want to take a quick look this diagram: What is inside the runtime coroutine object?

template<typename CRET>
class val_promise_type{
public:
  // we (not the C++20 coroutine framework) require 'return_object' to have a 'value_t' typedef
  using value_t=typename CRET::value_t;

  CRET get_return_object(){return CRET(coroutine_handle<val_promise_type>::from_promise(*this));}
  void return_void(){}
  std::suspend_always yield_value(value_t val){val_=val;return {};}
  std::suspend_always initial_suspend(){return {};}
  std::suspend_always final_suspend(){return {};}
  value_t value(){return val_;}
private:
  value_t val_;
};

The promise_type is a class used by and required by the C++20 coroutine framework. This is what the promise_type functions mean:

get_return_object: called by the framework when the return object from the coroutine is created and returned to the caller
return_void: called when we do a co_return from the coroutine without specifying a return value. In our example, we are not doing this just yet!
yield_value: called by the framework when co_yield is called in the coroutine functions
initial_suspend: when the coroutine is first called, this method controls how the execution of the routine should proceed. In our promise_type we return an object suspend_always which means that the coroutine body will not execute on the first call
final_suspend: similar to initial_suspend except this method is called right before the coroutine complets execution

There is a lot more to say about what methods are called when and under what circumstance they are called, what the implications are and as well as how the compiler finds the promise_type. For right now I won’t get into this since it would muddle the generator example.

The complete code

We now have all the parts: the promise_type, the return object and the coroutine function. Let’s put all of it inside a single code snippet:

#include <iostream>
#include <vector>
#include <optional>
#include <queue>
#include <coroutine>
using namespace std;

// --- (begin) promise object
// promise used when we need to store a single value
template<typename CRET>
class val_promise_type{
public:
  // we require 'return_object' to have a 'value_t' typedef
  using value_t=typename CRET::value_t;

  // these functions are required by teh C++20 coroutine machinery
  CRET get_return_object(){return CRET(coroutine_handle<val_promise_type>::from_promise(*this));}
  void return_void(){}
  std::suspend_always yield_value(value_t val){val_=val;return {};}
  std::suspend_always initial_suspend(){return {};}
  std::suspend_always final_suspend(){return {};}
  
  // this is our own function (not required by the C++20 coroutine machinery
  value_t value(){return val_;}
private:
  value_t val_;
};
// --- (end) promise object

// -- (begin) couroutine return object for generator
template<typename T>
struct Gen{
  // some typedefs
  using value_t=T;
  using promise_type=val_promise_type<Gen>;

  // keep handle so we can get to promise
  coroutine_handle<promise_type>handle_;

  // ctor
  // (called from 'promise_type::get_return_object()')
  Gen(coroutine_handle<promise_type>handle):handle_(handle){}
  
  // dtor
  ~Gen(){handle_.destroy();}

  // get current value
  value_t value()const{
    return handle_.promise().value();
  }
  // get next item
  optional<value_t>next(){
    if(handle_.done())return std::nullopt;
    handle_.resume();
    return handle_.done()?optional<value_t>{}:value();
  }

};
// -- (end) coroutine return object

// --- (begin) generators

// simple integer sequence
Gen<int>genint(int n){
  for(int i=0;i<n;++i)co_yield i;
}
// we’ll add a few more generators here later
…

// --- (end) generators

// run generator and print result
void genprint(auto&gen){
  while(auto val=gen.next()){
    cout<<val.value()<<endl;
  }
}
// test program
int main(){
  cout<<"----- int sequence"<<endl;
  auto gen=genint(5);
  genprint(gen);
}

Using gcc (GCC) 10.2.0 you can compile and link the code with the following command:

g++  -std=c++20 -fcoroutines  -fno-exceptions -o test1 test1.cc

What just happened?

Before continuing, you might want to take a quick look this diagram: What is inside the runtime coroutine object?

Here are a few points that might help in understanding how the code works:

when the coroutine genint is called the framework creates (among other things) a promise_type object by calling the get_return_object on the promise object
in the get_return_object we construct a Gen<int> object which is then returned to the caller. Notice that we stuff a coroutine_handle object in the returned object
later we call next on the returned object which in turn calls resume on the coroutine_handle
the resume call tells the framework to resume execution of the coroutine where it last left off. In our case, if this is the first time it will enter the for-loop in the coroutine function
the coroutine function executes until it gets to the co_yield statement. This statement triggers the framework to execute the yield_value function in the promise object passing the value we gave to co_yield as parameter
in the yield_value function we stuff the value in the val_ attribute so it can later be retrieved
now we are back in the next function in the Gen object. There we check if the coroutine is done. If the coroutine is not done we retrieve the value we stuffed in the promise and return it to the caller
the caller (in our case genprint) prints the value to stdout

There are a number of rules for how the framework behaves depending on what we do and what we return from methods in the return object and methods in the promise object. But for now, we won’t delve into those rules until we have a better grasp on how the whole machinery works.

More generators

The bad thing is that we had to write a lot of filler code to get a simple generator to work. The good thing is that we can now write more generators without writing more filler code.

BFS graph traversal

We can now use the generator we wrote to traverse a graph in BFS order.

The graph is defined using an adjacency list representation – that is, if vertex 5 has edges pointing to vertex 7, 8 and 10 then g[5]={7, 8, 10}:

// simple BFS traversal of graph
using graph_t=vector<vector<int>>;
enum color_t{white=0,grey=1,black=2};

Gen<int>genbfs(graph_t const&g,int ustart){
  if(g.size()==0)co_return;
  vector<color_t>color(g.size(),white);
  queue<int>q;
  q.push(ustart);
  color[ustart]=grey;
  while(!q.empty()){
    int u=q.front();
    q.pop();
    co_yield(u);
    for(int v:g[u]){
      if(color[v]!=white)continue;
      q.push(v);
      color[v]=grey;
    }
    color[u]=black;
  }
}

An example of BFS traversal of a graph is the following code which can be copied and pasted into the main program:

cout<<"----- bfs graph"<<endl;
graph_t g{
    {1, 2},
    {4},
    {3},
    {4},
    {5},
    {}
  };
  auto bfsgen=genbfs(g,0);
  genprint(bfsgen);

Character reader + tokenizer

A more interesting example, where two coroutines cooperate, is an example where one coroutine reads data from a source and then feeds the data to a tokenizer:

// character reader + tokenizer
Gen<char>genreadchar(string const&str){
  for(char c:str)co_yield c;
}
Gen<string>gentok(Gen<char>&creader){
  string currtok;
  while(creader.next()){
    char c=creader.value();
    if(c==' '){
      if(currtok.length()){
        co_yield currtok;
        currtok.clear();
      }
    }else{
      currtok.push_back(c);
    }
  }
  if(currtok.length())co_yield currtok;
}

The code to run this from the main program is then:

  cout<<"----- tokenizer graph"<<endl;
  string inpstring="Hello again where is the cheese";
  auto creader=genreadchar(inpstring);
  auto tokreader=gentok(creader);
  genprint(tokreader);

Permutations

We can write a final generator that creates all permutations of a string. We’ll use the Knuth algorithm (Knuth, Volume 4A, page: 319) that generates permutations in lexicographical increasing order:

// generate permutations in lexicographical order
Gen<string>genperm(string s){
  int n=s.length();

  // sort in ascending order and return the result as the first permutation
  sort(begin(s),end(s));
  co_yield(s);

  if(n==1)co_return; 
  while(true){
    // find largest k such that s[k]<s[k+1]
    int k=n-2;
    for(;k>=0;--k){
      if(s[k]<s[k+1])break;
    }
    if(k<0)co_return;

    // find largest l>k such that s[k]<s[l]
    int l=n-1;
    for(;l>k;--l){
      if(s[k]<s[l])break;
    }
    swap(s[k],s[l]);
    reverse(begin(s)+k+1,end(s));

    // we now have the next permutation
    co_yield s;
  }
}

And the call from the main program is:

  // generate permutation of string in lexographical order
  string str="1234";
  auto permgen=genperm(str);
  genprint(permgen);

What is inside the runtime coroutine object?

I will start by drawing a picture that conceptually describes what is going on:

Converted document

An important interface to the C++20 coroutine machinery is the awaitable/awaiter interfaces. These interfaces are not part of this diagram. I excluded them since there are already enough moving parts in the diagram. I will describe them in the next blog entry.

An important observation is the triplet [bookkeeping, frame, promise]. The triplet is the bundle that represents a coroutine object. The bookkeeping part contains information about where to start when a coroutine is resumed. The frame contains temporary variables, variables in the coroutine and parameters to the coroutine. The promise is the part we have some control over. For example, the yield_value function is called when co_yield is invoked from the coroutine. In that function we have a hook to store the value so we later can retrieve it.

Using gcc10 in a Docker based environment

I wrote and tested the code within a Docker image to avoid polluting the machine with more versions of C++ compilers and C++ environments. However, if you already have a C++20 compliant compiler installed, you can safely skip this section. Below is a brief description of how my environment was configured.

Be aware that I'm not an expert in configuring Docker. However, my basic setup appears to work OK.

The setup contains two files:

Dockerfile
Makefile

The Dockerfile lays out the docker image and looks like this:

FROM gcc:latest

# setup init process
ENV TINI_VERSION v0.19.0
ADD https://github.com/krallin/tini/releases/download/${TINI_VERSION}/tini /tini
RUN chmod +x /tini
ENTRYPOINT ["/tini", "-g", "-p", "SIGTERM", "--"]

# install vim (does not come with gcc:lates
RUN apt-get update && apt-get install -y vim

# create/configure new user
ARG arg_user
ARG arg_group
ARG arg_uid
ARG arg_gid

RUN echo $arg_gid
RUN echo $arg_group

RUN groupadd -g $arg_gid $arg_group
RUN useradd -u $arg_uid -g $arg_gid -ms /bin/bash $arg_user
WORKDIR /home/$arg_user

COPY vimrc /home/$arg_user/.vimrc
RUN chown $arg_user .vimrc
RUN chgrp $arg_group .vimrc

COPY bashrc /home/$arg_user/.bashrc
RUN chown $arg_user .bashrc
RUN chgrp $arg_group .bashrc

# run container as $user_arg
USER $arg_user

# start bash
CMD /bin/bash

As you probably can infer from the Dockerfile that the user within the docker image will have the same user name and user id as the user building the image (not as the one running the container).

The Makefile is a quick hack that contains simple commands for building the Docker image as well as starting a container. Additionally it contains a few commands for removing the image, removing containers etc.:

IMAGE=hans-cpp20

all: usage

run:
        docker run -v `pwd`/src://home/$${USER}/src --name $(IMAGE) --rm  -it $(IMAGE) 

build:
        docker build --build-arg arg_user=$$(id -un) --build-arg arg_group=$$(id -gn) --build-arg arg_uid=$$(id -u) --build-arg arg_gid=$$(id -g)  -t $(IMAGE):latest .

rmimage:
        docker image ls | grep $(IMAGE) | awk '{print $$3}' | xargs -I{} docker image rm "{}"

lsimage:
        docker image ls | grep $(IMAGE) 

rmcont:
        docker container ls -a | grep $(IMAGE) | awk '{print $$1}' | xargs -I{} docker container rm "{}"

lscont: 
        docker container ls -a | grep $(IMAGE)

save:
        docker save -o $(IMAGE).tar hans-centos-test1

load:
        docker load -i $(IMAGE).tar

prune:
        docker image prune -f
        docker container prune -f

clean: rmcont rmimage
        rm -f junk.txt *.tar

usage:
        @echo "make all                 # print this message"
        @echo "make run                 # run docker image"
        @echo "make build               # build docker image"
        @echo "make rmimage             # remove image"
        @echo "make lsimage             # list image"
        @echo "make rmcont              # remove container(s)"
        @echo "make lscont              # list container(s)"
        @echo "make save                # save image to tar file"
        @echo "make load                # load image from a saved tar file"
        @echo "make prune               # prune stopped containers and dangling images"
        @echo "make clean               # remove container + image"

To build the Docker image run the following command:

make build

To run a container, execute the following command:

make run

To exit the container do:

^D

The setup maps the src directrory in the current directory to directory /home/<user>/src iside the Docker container. So anything you do inside or outside the container to this directory is reflected inside and outside Docker.

Next: handling exceptions and scheduling coroutines for execution in a thread pool

In the first followup blog entry I will show how exceptions are managed within the framework of C++20 coroutines. The examples in this blog did not deal with exceptions. In fact, on the compile line we explicitly specified that exceptions would not be supported.

In the blog entry following exception handling I will show how to move coroutine objects into a thread pool for scheduled execution. The code will use awaiter objects to control how a coroutine is suspended when the co_await operator is executed.

Friday, November 22, 2019

A transactional extension to para

He said that for a sorcerer, the world of everyday life is not real, or out there, as we believe it is. For a sorcerer, reality, or the world as we all know, is only a description

-- Carlos Castaneda, Jurney to Ixtlan, The lessons of Don Juan.Washington Square Press 1972, page viii

Introduction

In a previous blog entry, Efficient processing of line based text, I described the program para. In the blog I described how processing time can be significantly reduced by using para to spread the load of processing lines from an input stream across multiple identical processes each processing one line at a time.

When running para on very large files where processing time is counted in days rather than minutes, it would be beneficial to be able to restart from a well known point should processing crash the middle instead of having to re-run from the beginning.

In this blog entry I describe in a very short overview an extension to para that supports a simple form of transactions that allows recovery from crashes in a safe and predicable way. More detailed documentation can be found in the github README file

Transactions in para

Transactions in para are committed every N lines where N is specified as a command line parameter. When a transaction is committed a transaction log is written to a file on disk. A transaction log is either new or replaces the previous transaction log, In either case, the log is created atomically.

If a transaction log exist and para is started in recovery mode, para will read the transaction log and adjust it's processing based on the information in the log. For example, say the transaction log contains the following information:

#lines committed: 10
position in output file: 1038

When para recovers it will skip the first 10 lines and seek to position 1038 in the output stream (if the stream corresponds to a file) and then continue normal processing.

Examples

To check if para will attempt to recover from a stored transaction-log the following command can be executed:

$ para -r

para will output something like:

recovery info --> #lines-committed: 1100, outfile-pos: 133664

Say we have input.txt:

and an executable exe.bash:

while read line; do
  echo $line 
  sleep 1
done

and we execute (-R is recovery and -C 2 enables transactions every 2 lines):

$>para para -v -i input.txt -o output.out -R -C 2  -- 1 ./exe.bash

the output.txt is then:

Now, if we execute the same command but hit ^C after seeing the message:

info: committing at 2 lines...

The output is now:

1
2

or possibly:

1
2
3

and the file: .para.txnlog contains (using od -x .para.txnlog):

0000000 0002 0000 0000 0000 0004 0000 0000 0000
0000020

We can view the transaction log by executing:

$para -r

para informs that 2 lines were committed and output in recovery mode starts at position 4 in the output file:

recovery info --> #lines-committed: 2, outfile-pos: 4

Now we can restart para:

$>para para -v -i input.txt -o output.out -R -C 2 -- 1 ./exe.bash

para now writes:

info: skipping first: 2 lines in recovery mode, outfilepos: 4 ...
info: positioning to offset: 4 in output stream
info: committing at 4 lines ...
debug: #timers in queue: 1 (expected: 1 HEARTBEAT timer)
info: committing at 5 lines ...
debug: closing files ...
debug: waiting for child processes ...
debug: cleaning up memory ...
debug: ... cleanup done

We see that para skips 2 lines in the input file and positions at file position 4 in the output file before starting to process. The output file is now:

Thursday, July 18, 2019

Efficient processing of line based text

The less we understand something, the more variables we require to explain it.

-- Ackoff, The Art Of Problem Solving (p. 111)

Introduction

This blog entry contains a few notes about the para program I put on github (https://github.com/hansewetz/para). More detailed documentation can be found in the github README file.

Background to the para program

A typical processing task in computational linguistics applies some processing function to each line in a file. The result of processing a line is written to some output - typically stdout. A constraint on the processing is normally that the generated output must be be in the same order as the lines read from the input.

The idea behind the para program was (as far as I know) invented by one or more computational linguists. The problem para solves is to parallelize processing of lines in a text file while ensuring that the result of processing line N is written to the output as line N.

A small example

Say you want to calculate the md5 sum of each line in a file. First we'll write a small bash script md5.bash:

#!/bin/bash
while read line; do
  echo $line | md5sum | awk '{print $1}'
done

The following command:

$ echo hello | ./md5.bash

will print:

b1946ac92492d2347c6235b4d2611184

Now say you have a very large file containing many millions of lines of text and want to calculate the md5 sum for each line in the file. If the file monolingual.en.gz is a gzipped file containing the text to process, we can execute the command:

$ time zcat monolingual.en.gz | ./md5.bash > out1.md5sum

On one of the machines I use I get the following timing information:

real    21m22.874s
user    10m18.485s
sys     28m48.447s

The problem when calculating the md5 sum for many millions of lines is that it takes time. Using para we can speed up the calculation by running 5 processing units in parallel:

$ time zcat monolingual.en.gz | para -- 5 ./md5.bash > out2.md5sum

Running the example on the same machine as I previously executed md5 on I now get the time:

real    4m34.788s
user    10m42.282s
sys     29m59.132s

We can see that the processing time is approximately 5 times faster. This is because para was executed with 5 sub-processes, each running in parallel.

How does para work?

para internally starts a number of sub-processes. In our example para started 5 sub-processes where each sub-process executed the command specified on the para command line.

Internally para sets up a pipe, to be precise a full duplex socket pair, between itself and each sub-process. Each line read from the input is fed into one of the sub-processes. Ones a line is sent to a sub-process para does some book keeping tracking the order of lines read and lines sent to sub-processes.

When a sub-process has finished processing a line the result is read by para through the socket pipe. para through its book keeping machinery will either write the result to its output, or alternatively buffer it internally when writing the output would violate output ordering. If the result is buffered para, will track when it can write the buffered result to output so that the input and output ordering stays aligned.

Comments on implementation

For portability reasons the code is written in plain C. My preferred choice would have been C++ together with boost since plain C does not have the same type of standard underlying support libraries. Additionally the language is ... let's say ... pretty limited compared to modern C++. However, C++ and boost create a number of installation and portability issues which are not always easy to solve.

Because of the limited standard library support for plain C data structure, some data structures together with algorithms were implemented from scratch. For example, priority queues, queues and buffers were implemented in the para source code tree.

Improvements to be implemented

para writes one line at a time to a sub-process. If the sub-process is not time consuming, reading and writing from/to sub-processes can easily become the bottle neck where para consumes most of the CPU time. A conceptually straight forward optimization in para would implement writing of multiple lines in one shot to sub-processes.

para buffers have fixes size and cannot be dynamically re-sized. This can be annoying since it requires a user to set a maximum length for lines before running para or run with the default setting and hope that lines are not longer than the default maximum line length. In reality this is not such a big problem since para will not crash in an uncontrolled fashion. If a line is longer than the buffer size para will simply terminate with an error message.

Features to be implemented

para should be able to read input by connecting to a network based server which feeds para with lines of text. para should also be able to connect and send output to a network based server. Exactly how to control number of lines read from the network based server must be thought through and specified.

Download and install para

para can be downloaded and installed from https://github.com/hansewetz/para. Building and installing para is a straight forward cmake based process detailed as 5 simple steps on the github page.

Tuesday, October 16, 2018

An API for configurations in a UNIX environment

Ruthless in purpose insidious in method

-- Orphan Black TV Series, season 3 Episode 8

Introduction

>>>The code together with Getting started documentation is available under github.<<<
>>>For a quick start please see: github<<<.

Some months ago I ran into a problem related to passing configuration information to a set of applications. I wanted to retrieve configuration information in the form of wrkdir = `pwd`/"wrkdir"-${HOSTNAME}/${USER}. Additionally I needed to access the configuration information from bash scripts, C++ programs and Python programs.

A configuration standard together with an API that would support what I needed was nowhere to be found.

In this post I describe a first cut of a small, simple but quite powerful configuration language together with a supporting API. It can be used in bash programs as well as in C++ programs. In later posts I will show an implementation of the corresponding Python API.

A small example

I'll use the following configuration file - test.cfg - to illustrate how the configuration machinery works:

namespace sys{
  du=`df -k . | grep '%' | grep -v Use | awk '{print $5}'`
  user=${USER}                                                
  usermachine=@"%{user}@`hostname -i`"
}
${USERMACHINE}=sys.usermachine    # export USERMACHINE environment variable

The configuration code illustrates:

line#1: namespace for grouping of variables
line#2: assign the output from execution of a command to a configuration file variable
line#3: assign an environment variable to a configuration file variable
line#4: interpolate a string containing the value of a configuration file variable as well as the result of a command
line#6: export the environment variable USERMACHINE into the environment

We can view the configuration by using the xconfig utility:

xconfig test.cfg

the output from xconfig is:

sys_du="51%"
sys_user="hansewetz"
sys_usermachine="hansewetz@10.0.0.4"

We can also read the configuration file from a C++ program as shown here:

#include "xconfig/XConfig.h"
#include <iostream>
using namespace std;
using namespace xconfig;

int main(){
  XConfig xfg("test.cfg");
  for(auto&&[name,value]:xfg())cout<<name<<": "<<value<<endl;
}

the output after running the program is:

sys.du: 51%
sys.user: hansewetz
sys.usermachine: hansewetz@10.0.0.4

The for loop prints all variables from the configuration file. The separator between the namespace and a variable name is a dot in this output. In the output from xconfig the separator is an underscore so that the variable can be used in a bash context.

xconfig utility program

Not yet done

Requirements on configuration language

There are a few basic requirements that I put on the configuration language and the supporting API. To start with, it must be possible to access environment variables. I also want to be able to set variables in the current environment.

Next, I should be able to retrieve the output from a command and store it in a variable. I must also be able to concatenate strings and/or integers to form longer strings.

To avoid name clashes in the space of variable names it must be possible to create logical groupings of variables. In the previous example the namespace construct was used to group variables in a sys namespace.

For convenience, I want to be able to catenate environment variables, output from commands, normal variables as well as strings and integers inside a string. In the example above I used interpolation (activated by prepending the string with an at sign) to evaluate the value assigned to the variable usermachine.

Requirements on the configuration API

Requirements on the API are fairly simple. First it must be possible to get a complete listing of all variables in the configuration file. For convenience it must also be possible to select variables by name, by regular expression or by namespace.

Before retrieving variables the configuration must be readable from a file or a stream. The API should expose a single method to make this as painless as possible.

Finally, for debug purpose it must be possible to dump various types of information related to the parsing of a file . For example, in a complex configuration file we might want to see how the internal machinery of the parsing evaluates certain expressions.

Additional requirements

The configuration language should not bea full fledged programming language. The syntax should be simple, it must be easy to read and it must be fairly easy to learn. Specifically I want to avoid an XML type syntax since large XML files are notoriously difficult to read by humans.

Configuration language

The language follows a C like style where variable names consists of alphanumeric characters including the underscore character. Variables can be assigned string or integer values:

user = "hansewetz"             # 'user' is a string
maxthread = 20                 # 'maxthreads' is an integer

Environment variables are accessed by using a bash style notation:

user = ${USER}             # 'user' is assigned the value of environment variable 'USER'
user1 = $USER              # works without braces also

Normal variable s are access by using their name directly or by prefixing them with '%':

user = @"${USER}"                        # 'user' is assigned the value of environment variable 'USER'
userAtMachine= @"%{user}@`hostname -i`"  # access variable 'USER' inside a string

Commands can be executed by enclosing the the command inside back-quotes:

machine = `hostname`   # get name of machine

The plus operator concatenate strings and adds integers:

user = "hans"+"ewetz"         # user = "hansewetz"
minthreads = 10
maxthreads = minthreads + 8   # maxthreads = 18

An expression on the right hand side of the assignment operator can access a variable using dot separated namespaces:

namespace system{
  user = $USER                     # user = "hansewetz"
  machine = `hostname -i`          # machine = "10.0.0.4"
}
userAtMachine = @"%{system.user}@%{system.machine}"   # userAtMachine = "hansewetz@10.0.0.4"

As the previous example showed strings can be interpolated. During interpolation environment variable, normal variables as well as commands are interpolated. A variation of the previous example is shown here:

namespace system{
  user = "@USER"
  machine = `hostname -i`          # machine = "10.0.0.4"
}
userAtMachine = @"%{system.user}@%{system.machine}"   # userAtMachine = "hansewetz@10.0.0.4"

There are some restrictions on how variables can be used. A variable cannot be assigned to once it has a value. Neither can a namespace qualified variable be used on the left hand side of the assignment operator.

Design notes

The design is relatively straight forward:

bison/flex is used for parsing the configuration
during parsing opcodes are written as a program into a virtual machine
the virtual machine executes the program populating the virtual maches memory with the result
a user uses an extractor to retrieve data from the memory of the virtual machine

The design is shown in this diagram:

Implementation notes

The virtual machine is implemented as a simple stack machine tailored specifically for this project. The name of the virtual machine is MMVM - Mickey Mouse Virtual Machine. The MMVM currently supports 13 opcodes. Among them are simple operation such as 'push value on stack' or 'store value in memory'. More complex operations such as 'evaluate a command in a shell and store output on stack' are also supported.

Now, one might wonder if it is not overkill to implement a VM just for parsing a configuration file. As it turns out, using a VM makes it almost effortless to support and experiment with new features in the configuration language. Since the MMVM implementation is only around 330 lines of C++ code it can be implemented in only a few hours.

The MMVM currently supports strings and integers. The support is not extensive. Two integers can be added, two strings can be concatenated and integers can be numerically added.

Evaluation of a configuration file is done in two steps:

compile the file into a program (generate a program consisting of op codes for MMVM)
execute program (sequence of opcodes

Namespaces are used to scope variables. The full name of a variable is a series of dot separated namespaces followed by the name of the variable. It is not necessary to use a fully namespace qualified name of a variable. When namespaces are left out, the compiler looks for the variable in the enclosing scope(s) . For example:

# test1.cfg
a = 18
namespace ns{
  namespace ns1{
    b = 19
    namespace ns2{
      c = b       # 'b' is found one step up
      d = ns1.b   # same as previous
      e = a       # 'a' is found in outer most scope
    }
  }
}

The output from executing: xconfig test1.cfg:

a="18"
ns_ns1_b="19"
ns_ns1_ns2_c="19"
ns_ns1_ns2_d="19"
ns_ns1_ns2_e="18"

When interpolating strings it must be possible to handle namespaces the same way as they are handled when referencing variables outside string interpolation. In order to reference variables without specifying the full namespace path, we need a table that contains information about the location of a variables within namespaces. This is done by building the table at compilation time and using it at runtime during string interpolation.

Downloading and building the xconfig API

Please see xconfig under github.

Bugs

Not yet done

Improvements

Not yet done

Thursday, August 16, 2018

A short signal-handling-hack

… unbounded knowledge is an adventure at the boundary of uncertainty …

-- Jacob Bronowski (The Acent of man)

Introduction

Following is a short simple hack implementing an RAII based signal handling guard passing signals as template parameters.

The guard constructor sets a signal handler for a list of signals and the destructor resets the original signal handlers. The code makes use of variadic templates in a way I find pretty cool.

There is not much to this blog entry except for the code showing how to apply a function to each element in a list of template parameters.

Using the code

A small example using standard RAII style shows how the signal guard can be used:

// test program
int main(){
  {
    // install a signal handler for some signals
    sigguard<SIGINT,SIGHUP>([](int sig){});

    // code, code, code ...
    // ...
  }
  // old signal handler re-installed when 'sigguard' destructor runs
  // ...
}

The code

#include <stack>
#include <iostream>
#include <utility>
#include <csignal>
#include <cstring>

template<int...SIGS>
class sigguard{
public:
  // ctor
  sigguard(sighandler_t fsig):fsig_(fsig){
    setsigs(std::integer_sequence<int,SIGS...>());
  }
  // dtor
  ~sigguard(){
    while(!oldhandlers_.empty()){
      auto const[sig,fsig]=oldhandlers_.top();
      oldhandlers_.pop();
      signal(sig,fsig);
    }
  }
  private:
    // install signal handler for list of signals
    template<int...sigs>
    void setsigs(std::integer_sequence<int,sigs...>){
      using xxx=int[];
      (void)xxx{(setsig(sigs),0)...};
    }
    // set a single signal handler and save the old one so it can be set in the dtor
    void setsig(int sig){
      auto oldhandler=signal(sig,fsig_);
      if(oldhandler==SIG_ERR){
        throw std::runtime_error(std::string("could not set signal handler: ")+strsignal(sig));
      }
      oldhandlers_.push(std::make_pair(sig,oldhandler));
    }
    // old signal handlers
    std::stack<std::pair<int,sighandler_t>>oldhandlers_;

    // signal handler function
    sighandler_t fsig_;
};

Thursday, May 4, 2017

Evaluation of Translation Quality using Neural Networks - a Proposal

Organizations which design systems are constrained to produce designs which are copies of the communication structures of these organizations

-- Conway’s law

In this blog post I present an idea for how human evaluation of translation quality can be incorporated into a method for determining the quality of systems that automatically translates from one (human) language to another.

Before getting started it is appropriate to issue a spoiler alert! This post does not present any concrete results. The main point of this blog post is to present an idea that I (or someone else) will verify through experiments. In fact, the idea is not presented as an algorithm but as a method which can be implemented in a variate of ways.

The Problem

When evaluating the quality of automatically translated text it is not so easy to algorithmically specify what it means for a translation to be good or bad or maybe just OK. Yes, there are numerous measures that can be concocted that will indicate that a translation is in some sense close to a reference text. But, being close based on some measure such as edit distance or precision does not say much about the quality of the translation.

In this short blog post I outline a simple method that should generate quality scores that are comparable to quality scores created by human evaluation. I say should since I have only done a few experiments with a small amount of data (while at the same time learning about tensorflow and neural networks).

There are a number of measurements that can be used to evaluate if a sentence is close to a reference sentence. Some measurements are:

edit distance
longest common subsequence

There are of course numerous other measures (or combination of measures) that can be used to indicate if a sentence is in one way or another close to another sentence.

The problem with such technical measures (or features) is that they say little about the quality of a translation. This should not come as a surprise since there is no algorithmic definition of what constitutes a good translation. From my experience working with computational linguists the only method that can generate a reliable quality measure is to have a human evaluate the quality using his or hers language skills.

I'll give an example that illustrates the problem with not being able to reliably evaluate the quality of translation. When comparing SMT (statistical machine translation) systems with NNT (neural network translation) systems (or any other type of system) a precision based score called BLEU is often used as a quality measure. Listening in on comments from computational linguists I have heard comments like: the NNT translation is much better but have about the same BLEU score as the SMT translation. Surprisingly, even though it is patently clear that BLEU is not at all a reliable measure of quality, it is still being used - mostly for the reason that it is the best there is.

The Idea

Since we don't have an algorithmic (or formulaic) description of what constitutes translation quality its clear that human evaluation must play a part if we want a create a quality measure that makes sense. The question is: how can human evaluation be used in an automated method that calculates a translation quality score. Certainly, we do not want a human to evaluate every single translated sentence.

The solution is conceptually quite simple. Given a reference sentence and a translated sentence we evaluate a number of measurements (or features) over the sentence pair. The sentence pair is also evaluated and scored by a human. For example, the score could be a number pulled from a range - say 1 ... 10. The feature values together with the human evaluated score can now serve as training data for a neural network.

Once the network has been trained, it can be used when evaluating the translation quality of one or more sentences (as long as the translated sentences have corresponding reference sentences). A set of features are first evaluated on the translated sentences and the reference sentences after which they are fed into the neural network. The output from the neural network is now a quality score which should be comparable to a human evaluation of the translation.

Why should this approach work? The reasoning goes along the following lines. Each feature should in one way or another state something about translation quality (or lack thereof). It seems reasonable that some (non-linear) combination of the calculated features should capture what constitutes translation quality. The problem is that we don't know how to combine features so that we can calculate a meaningful quality score. A simple brute force way of creating a non-linear combination of features is to train a neural network so that it learns what translation quality means in terms of features.

Now, this may or may not work. Clearly we must include the right features. What the right features are is probably not a trivial task to decide on. However, it should not hurt to include features which are not essential since such features should be deemed irrelevant during the training of the neural network. If we miss some essential features it should show up as poor test results for the neural network.

What I have done so far

As mentioned earlier, this post is mostly about the idea or method - not about presenting the result of a full blown experiment. Due to lack of time, I have only implemented a small experimental 3 layer neural network using tensorflow where the input was defined by the following 8 features:

edit distance (word and character)
longest common subsequence (word and character)
longest common sequence (word and character)
precision
accuracy

In total I manually evaluated 100 sentence pairs assigning a quality score to each one (which even by a long shot is probably not enough training data). The quality scores together with calculated features were used to train the neural network. The network was then used (in an ad-hoc non-systematic fashion) to evaluate segments generated by a translation system against their corresponding reference sentences. The result showed that the network mapped features relatively close (using manual inspection) to scores I assigned to reference segments.

Unfortunately I have not had the time to gather more training data and do a proper objective systematic evaluation of the results. At this stage this project is still at the stage where I consider it a cool idea that needs further work.

Hopefully I will have time to work on the idea presented here in the following weeks. I also hope to be able to present a systematic objective result from a trained quality evaluation network. Until then, if someone has ideas for why the method described here will or will not work, feel free to let me know.

A few random observations

A potential drawback of the method is that each language requires its own quality evaluation network. Practically this means that training data for each language must be manually prepared which involves a non-negligible amount of work. However, once a network has been trained it should be usable when evaluating any automatic translation system.

A Computational Linguist colleague of mine suggested that it might be possible to use a quality evaluation network as an error function when training a translation system (SMT or NNT). If the method presented in this blog works OK the idea would be worth investigating.

Why did I choose a 3 layer design of the neural network - why not 2? The reason was rather simple: the idea was that the middle layer would create abstractions that would be aligned with human related language qualities - for example fluency and similarity. I don't have enough knowledge of neural network design to really say if the idea is correct or not. Maybe after understanding more about neural networks I will be able to verify it.

Thursday, October 29, 2015

Modelling I - Building Blocks

To perform an analysis means decomposing a situation or problem into its constituents. However, there are not any “primitives” that naturally exist. The components defined are arbitrary and depend on individual goals and methods of calculation. The basic elements of a fire are different to a fire fighter, an insurance claims adjuster, and an arson investigator.

-- Gary Klein, Sources of Power

This is the first post in a planned series of blog posts about modelling in IT projects. Each post will target a specific topic within the area of modelling. My goal is to, in each blog entry, get to the most basic and fundamental ideas within the topic of the blog post.

There are a few reasons why I decided to write the posts. The main reason is simple: it just really bothers me that many persons involved in IT development does not view modelling as an activity in and by itself. It also really bothers me that many IT professionals I've worked with over time do not have a clear view of basic ideas and concepts related to modelling. Finally, this is after all a blog and a good reason for writing about modelling is to take stock of a few ideas I've picked up over the years.

I won't start this series with why it's a good idea to treat modelling as a special activity in IT projects – I'll come to that part in a later blog entry. Instead I'll start from the bottom and in a rather dry way describe a few of the basic building blocks used in modelling.

Idioms, patterns, principles and other useful stuff

Before addressing the topic of modelling I want to point out that there are many other ways which are not directly related to modelling that are often used when building IT systems. Some are Idioms, patterns and principles. Of course the list can be made longer. Other tools often useful for thinking include the use of analogies, generalization, abstraction and general system laws among others.

Since most IT people know quite a lot about idioms, patterns and principles I won't discuss them here. Neither will I discuss General Systems Laws since I can get by without them when describing the basics of modelling. The use of analogies, generalization and abstraction are useful when building models and will pop up in some of the posts in this series.

Entities and value objects

An entity is something that has a state that can change over time. An entity also has an identity that is independent of its state and can be used to distinguish the entity from any other entity. Finally, an entity persists over time. Here I’m not talking about persist in the sense of database persistence, but rather in the sense of continues to exist over time.

Are there things that do not continue to exist over time? In the modelling system I describe here there are. An example of something that does not persist over time is an event. Events don't even persist over infinitesimal short time intervals - but more on that later on.

A value object on the other hand has an immutable state and cannot be distinguished from other value objects having the same state. Therefore, you can transparently interchange two value objects having the same state. I italicized the word state since it is questionable if a value object really can be said to have a state since it cannot be changed. Since its state cannot change it is also possible to say that a value object is defined by its value which then is equal to its immutable state. Value objects are also simply called values. Conceptually a value object has always existed and will always exist. In that sense a value object is timeless.

What I said about entities having an identity was s small lie – not a blatant big lie but nevertheless a lie. The truth is that an entity does not have to have a unique identifier. This is usually not so important but I'll explain it anyway. Take the example of a six pack of beer. All the beer cans are full and can be considered to have the same state. Now I grab one can, open it and take a sip and then put it back in the pack. The can I just grabbed has changed its state. Next comes along a neighbour that wants a fresh beer. He does not really care which can he grabs as long as the state of the can he grabs is non-opened. When deciding which can to grab he'll grab any can that has a state non-opened since non-opened cans are interchangeable - well, at least from the perspective of having a cold beer. So even though the cans are entities each having a state that can change there is no need to assign a unique identifier to each can. Now, in an IT system we would most likely assign some sort of technical identifier that uniquely identifies a can – a surrogate key or some address would do. In fact, that's what we almost always would do in an IT system because that is usually what we must do. In a modelling system it's mostly done for convenience – not out of necessity.

Events and messages are not the same thing

An event is an instantaneous happening. That's it – no more, no less. Since an event is an instantaneous happening there is an after the event and a before event but not a during the event. Also, events in contrast to entities do not persist over time. An event is instantaneous and happens at some specific point in time. Once it has happened the event no longer exist.

Since events are happenings you can't receive an event and you can't send an event. What you send and receive might be messages if you have the concept of message in your modelling system. You can however be notified that an event occurred. For example, you can be notified that a message arrived event happened so that you can pick up the message. The act of sending a message is also an event – there is a before the message was sent and an after the message was sent but no during the message being sent. In other words, it is not possible to send an event – a message might be sent but not an event.

Saying that something happened instantaneous is obviously relative to the granularity of your world clock. Something may or may not be seen as occurring instantaneous depending on how you measure time. If your world ticks through time one year at a time, then anything that happens between two clock-ticks is happening instantaneously and can therefore be seen as an event. Consider the example of: the Visigoths sacked Rome in 410 AD. A historian measuring time in years or decades may view the fall of Rome as an event in history as opposed to something that was ongoing over a period of time.

The obvious question now is: how should I choose the length of my clock ticks? As always in modelling it depends on what is important to put the focus on so that the problem at hand can be solved or understood.

If I in broad terms describe major wars that have occurred during the history of mankind I may choose a century as my timescale. For example, in the fifth century AD the Visigoths sacked Rome, in the 16th century Pizzaro conquered the Aztecs and in the 20th century two world wars ravaged Europe and other parts of the world. The people that were around when these events took place would most likely have used faster clock ticks and would not have viewed them as events. The instigators of the wars, and maybe also the defenders, would have described the wars in terms of processes – not events.

It is clear that the context, the timescale and the purpose will voice their opinions when you decide if something should be classified as an event or not. That leads me to the concept of processes which I describe in the next section.

Processes

The simplest view of what a process is goes as follows: a process is a sequence of events that are ordered in time. A process is simply a series of happenings.

If a process is a sequence of events and events are instantaneous happenings how is it then possible to have a process do things in parallel? Well, tasks can still be done in parallel. It is just that two events cannot happen simultaneously. Even though task T1 and task T2 are executing simultaneously, the start of T1 and the start of T2 cannot happen at the same time. This is the case within the modelling system I'm describing but it is not necessarily true in real life. An impportant take away lesson here is that a model is never the same thing as the stuff being modelled.

When talking about processes it's usually a good idea to separate process definitions from process instances. A process definition is the underlying mechanism that controls the sequence of events whereas a process instance is the actual sequence of events. What I described as a process in the previous paragraph is clearly a process instance.

A more business like view of processes definitions and process instances goes like this: in a business, processes exist for the purpose of achieving some goals that have been setup by the business. Running processes (process instances), manipulate things that are valuable to the business. The processes (instances) themselves don’t have any direct value to the business. After all, the sequence of actions that are performed when trying to achieve a goal is not important in itself. What is important is that the result is close to the goal. It should come as no surprise that the goal of most businesses has something to do with profit, cost savings, quality and time-to-market.

Let's stop and take a short thinking pause here. In the previous paragraph I pointed out something fundamental even though the idea of business processes seems rather trivial. Whenever a team is in the middle of designing and implementing a business process it is often worth taking a deap breath ones in a while and think about how the business benefits from the process. If there is no benefit there is most probably no point in implementing the process.

So what are those things that are manipulated by running processes? The things that processes manipulate are usually entities that are of value to a business. The goal of the business describes the desirable state of the entities. The goal is then clearly also the driver for defining how running processes should manipulate the entities – that is, the driver for creating process definitions.

The picture below illustrates the parts that are relevant to processes:

In the diagram the Process only manages objects indirectly through Actors. An Actor can of course be realized as a piece of software, as a person, an organization or some other agent. An Actor does not only modify entities, it can also tweak the Goals when needed. For instance, when a business strategy fails an Actor can take a decision to change the Goals.

The extra indirection introduced by the Actor allows us to separate what has to be done from how it is done. The Process manages what must be done whereas the Actor manages how it is done. When the Actor is implemented as a piece of software the Actor is often called a Service. The idea of Actors implemented as Services and the idea of separating the Process from the Actor is at the core of the rather nebulousidea called Service Oriented Architecture where processes can quickly be defined with the help of existing services.

Business objects and processes

Entities that are manipulated by a business process are often called business objects. Business objects come in many different forms. A few examples of entities and processes from the financial industry are:

An Account is often modelled as an entity and is manipulated by a many different processes. For example, a process managing the withdrawal from a cash account ensures among other things that it is not a fraudulent withdrawal, the right amount is withdrawn and that the cash is sent to the correct destination. Another process may bill the customer for the withdrawal service.
A Customer is also often modelled as an entity. There may be many processes that manage Customer entities. For instance, one process might create new Customers while another might handle the billing of Customers.

In short, you define processes so that you can control things that you believe are important for achieving some goals. Essentially, a process definition describes how you want the world to behave.

Life cycles and business processes

In this section I present possibly the most important topic in this blog entry. It is not important because lifecycles are important. It is in fact the exact opposite: lifecycles are almost always useless in modelling. Unfortunately lifecycles are prevalent in models almost everywhere and that is a big mistake since they almost always do more harm than good.

The lifecycle of an entity is a description of the state changes in an entity over time. It turns out that the lifecycle of a business entity is an illusion and can cause lots of grief. The reason is simple: a lifecycle is a consequence of one or more processes managing the business entity.

Business processes will twist and turn a business entity in such a way that the state of the entity benefits the business. A lifecycle of an entity represents the sequence of state changes in the entity that occurs as a consequence of one or more processes manipulating the entity. A lifecycle should therefore not be viewed as something intrinsic to the entity. In fact, it is often very difficult to deduce the lifecycle of an entity that is managed by multiple processes.

Maybe you object and say that it is the process, not the lifecycle that is the illusion, or that it makes no difference which view you choose. Not so. The relationship between processes and lifecycles is not symmetric. The reason for this goes back to the purpose of a business process. The purpose of a business process is to manage entities in such a way that certain goals of the business are met. It is not the other way around since the goals of the business are not managed by lifecycles of entities.

There is also a practical reason for not treating lfecycles as building blocks in models. When you fuse a lifecycle together with an entity you effectively grab the entity and eliminate the chance that it can easily be used in some other processing. After all, you’ll have to store a state attribute in the entity keeping track of where in the lifecycle the entity is. If by any chance you one day decide that the entity fits with some other processing, you already have incorporated a state attribute into the entity. Now, you either can add another state attribute to your entity in order to support the new processing or you can create a very complicated – here I really mean very complicated - lifecycle incorporating the new processing with the existing one.

From the discussion it is clear that the processes within an organisation will be subject of scrutiny when an organisation does not meet its objectives. It is usually not the lifecycles of the entities or the entities themselves that are the primary target of investigation. Instead, it is the processes that are the primary target for changes and optimization. A consequence of changing a process is that state-changes in business entities will follow a different pattern. This pattern or the so-called lifecycle of business entities may be altered, but again, only as a consequence of changing the business process.

Here is a concrete example from daily life that shows the fault of using lifecycles of entities as the primary focus when modelling. When you want to achieve a goal you probably do as I do - you make a plan. If I want to fly to London tomorrow I make a plan: I will set the alarm clock before going to sleep, I will turn off the alarm clock after waking up, I will drive to the airport in the morning, I will take the plane to London and so on. I plan my day based on the activities I have to execute so that I will achieve my goal of reaching London. I don’t plan my day based on the lifecycle of my alarm clock, the life cycle of my car and the life cycle of the airplane. State changes in my alarm clock, my car and the airplane are simply consequences of executing my process that will bring me to London.

Why did I put so much focus on not modelling life cycles? The reason is that in the past I've encountered lifecycles which were modelled as intrinsic properties of entities in multiple projects. In all cases – with no exceptions - the cost of making even minor changes to processes within the business was catastrophically high. In some cases entire systems were rewritten simply because of a modelling error which could easily have been avoided had the designers known not to use life cycles as intrinsic behaviour of entities.

The fine line between events, entities and processes

Many phenomena can be viewed as entities in one context and as processes in a different context. Consider a telephone call. When a telephone call is in progress the participants in the phone call can see it as a running process. From their viewpoint the call is a sequence of events. Now imagine that you work in the billing department for a phone company. Here you view the phone call as an entity since it contains a state representing information from the past when the call was made together with other possible business information which may be altered by one or more business processes.

The telephone call could also be described as an event in some other context. A few years after the call was made a lawyer might use the call in a court case where he presents the call as a happening that occurred on June 26 2015 at 4:02pm. If the duration of the phone call is of no significance lawyer will most likely describe the call as an event.

The takeaway knowledge form the example is this. There is no intrinsic property of a phone call that makes it a process, an entity or an event. Depending on the context and the purpose a phone call will be described (modelled) in fundamentally different ways.

A general observation which includes classification of something as an entity, process or event is: the identity, meaning or purpose of something is a relation, as opposed to an intrinsic property, between an observer and the something.

Implementing process execution – an observation

I often hear the comment that the make utility should be used to implement business processes. In general that is not a good idea for the simple reason that make derives the state of a process from the artefacts that are generated during the process.

In general the machinery executing a process should manage the state independently of any artefacts that were created. Many ad-hoc implementations however do not represent the state explicitly. Instead they derive the state from artefacts that have been created during execution.

A problem with deriving the process state from artefacts is that it is not in general possible to derive the process state solely from artefacts. For example, when a process execution crashes there may be artefacts left over which are not complete or that are corrupted. If the artefacts are used to derive the process state when restarting process execution there are no guarantees that the derived state is correct.

Roles

A role can be seen as a placeholder that an actor steps into at some appropriate point in time. For example, if you borrow 50 cents from me to buy a Coke you play the role of a borrower. When I lend you the money I play the role of a lender. The borrower role is a relation between the loan creation event and you. The lender role on the other hand is a relation between the loan creation event and me. In general a role is a relation between an event and one or more participants in the event.

It is clear that a role is one of tool that allows you to add an extra level of indirection at the modelling level. As you already know, indirections lie at the heart of developing maintainable software. Not surprisingly it is often also at the heart when developing maintainable models.

A last point I want to make about roles is that they are useful both at the problem level and at the technical solution level. When we model at the level of the problem roles are useful for understanding the relationships between concepts. When modelling with classes at the level of the technical design (solution level), roles decouple abstractions from a host of possible specializations. This often allows us to control dependencies between classes in better way than had we designed using base classes for all potential participants in some event.

Associations and attributes

It’s a common belief that associations are always binary, tying two things together. This is not the case. Here are a few examples of non-binary associations. A loan can be seen as an association between a lender, a borrower and an amount. A cash account in a bank can be modelled as an association between an owner, an amount and a currency. And finally an example that stretches my view of associations: a sailing boat can be viewed as an association between a hull, a mast and a sail.

Consider the following. Is my height an attribute or an association between myself and 192cm? Maybe the 192cm is an attribute and my height is the association. Or, could it be that my height is the attribute and 192cm is the value of the attribute? Or again, is height the attribute as opposed to my height? Maybe it is not an attribute at all. After all, I can state that I belong to a category of people having a height of 192cm? Does the height attribute refer to the symbol 192cm or does it refer to the distance of 192cm? Say I compare the height attribute, assuming one exists, of me with someone else that has the height of 1.92m. Are the attributes the same? Should I compare the distance that 192cm represents with the distance that 1.92m represents? Or should I compare the symbol 192cm with 1.92m?

Are attributes associations or something else? In the height example the question translates into the following: is the attribute a relationship between myself and a height (distance or symbol)? Or is it the height that is the attribute? The fact that there is no clear-cut answer is not so important. What is important is to realise that it is not clear cut when to model things as attributes or associations. It is important as well to realize that how to model something must be based on if it helps in understanding or solving a problem.

Summary

Most models created in IT projects contain handfull of commonly used constructs. The ones I've discussed in this post are:

Entities
Value objects
Processes
Roles
Associations
Attributes

I also discussed why lifecycles should not be used in models.

When classifying something as one of the named constructs it is important to take the context and a purpose into account. For example, when classifying a telephone call as a process it is done for some specific purpose in some specific context.

The take home lesson is that the thing being classified does not have intrinsic properties specifying how it should be classified.

As I mentioned in the beginning of this post, I aimed at describing some basic ideas and concepts in modelling. I hope you agree with me that I stuck to the basics without much elaboration. In the next blog post I'll discuss the basics of using concepts in IT models.