Thursday, July 18, 2019

Efficient processing of line based text

The less we understand something, the more variables we require to explain it.

-- Ackoff, The Art Of Problem Solving (p. 111)

Introduction

This blog entry contains a few notes about the para program I put on github (https://github.com/hansewetz/para). More detailed documentation can be found in the github README file.

Background to the para program

A typical processing task in computational linguistics applies some processing function to each line in a file. The result of processing a line is written to some output - typically stdout. A constraint on the processing is normally that the generated output must be be in the same order as the lines read from the input.

The idea behind the para program was (as far as I know) invented by one or more computational linguists. The problem para solves is to parallelize processing of lines in a text file while ensuring that the result of processing line N is written to the output as line N.

A small example

Say you want to calculate the md5 sum of each line in a file. First we'll write a small bash script md5.bash:

#!/bin/bash
while read line; do
  echo $line | md5sum | awk '{print $1}'
done

The following command:

$ echo hello | ./md5.bash

will print:

b1946ac92492d2347c6235b4d2611184

Now say you have a very large file containing many millions of lines of text and want to calculate the md5 sum for each line in the file. If the file monolingual.en.gz is a gzipped file containing the text to process, we can execute the command:

$ time zcat monolingual.en.gz | ./md5.bash > out1.md5sum

On one of the machines I use I get the following timing information:

real    21m22.874s
user    10m18.485s
sys     28m48.447s

The problem when calculating the md5 sum for many millions of lines is that it takes time. Using para we can speed up the calculation by running 5 processing units in parallel:

$ time zcat monolingual.en.gz | para -- 5 ./md5.bash > out2.md5sum

Running the example on the same machine as I previously executed md5 on I now get the time:

real    4m34.788s
user    10m42.282s
sys     29m59.132s

We can see that the processing time is approximately 5 times faster. This is because para was executed with 5 sub-processes, each running in parallel.

How does para work?

para internally starts a number of sub-processes. In our example para started 5 sub-processes where each sub-process executed the command specified on the para command line.

Internally para sets up a pipe, to be precise a full duplex socket pair, between itself and each sub-process. Each line read from the input is fed into one of the sub-processes. Ones a line is sent to a sub-process para does some book keeping tracking the order of lines read and lines sent to sub-processes.

When a sub-process has finished processing a line the result is read by para through the socket pipe. para through its book keeping machinery will either write the result to its output, or alternatively buffer it internally when writing the output would violate output ordering. If the result is buffered para, will track when it can write the buffered result to output so that the input and output ordering stays aligned.

Comments on implementation

For portability reasons the code is written in plain C. My preferred choice would have been C++ together with boost since plain C does not have the same type of standard underlying support libraries. Additionally the language is ... let's say ... pretty limited compared to modern C++. However, C++ and boost create a number of installation and portability issues which are not always easy to solve.

Because of the limited standard library support for plain C data structure, some data structures together with algorithms were implemented from scratch. For example, priority queues, queues and buffers were implemented in the para source code tree.

Improvements to be implemented

para writes one line at a time to a sub-process. If the sub-process is not time consuming, reading and writing from/to sub-processes can easily become the bottle neck where para consumes most of the CPU time. A conceptually straight forward optimization in para would implement writing of multiple lines in one shot to sub-processes.

para buffers have fixes size and cannot be dynamically re-sized. This can be annoying since it requires a user to set a maximum length for lines before running para or run with the default setting and hope that lines are not longer than the default maximum line length. In reality this is not such a big problem since para will not crash in an uncontrolled fashion. If a line is longer than the buffer size para will simply terminate with an error message.

Features to be implemented

para should be able to read input by connecting to a network based server which feeds para with lines of text. para should also be able to connect and send output to a network based server. Exactly how to control number of lines read from the network based server must be thought through and specified.

Download and install para

para can be downloaded and installed from https://github.com/hansewetz/para. Building and installing para is a straight forward cmake based process detailed as 5 simple steps on the github page.