Friday, November 22, 2019

A transactional extension to para

He said that for a sorcerer, the world of everyday life is not real, or out there, as we believe it is. For a sorcerer, reality, or the world as we all know, is only a description

-- Carlos Castaneda, Jurney to Ixtlan, The lessons of Don Juan.Washington Square Press 1972, page viii

Introduction

In a previous blog entry, Efficient processing of line based text, I described the program para. In the blog I described how processing time can be significantly reduced by using para to spread the load of processing lines from an input stream across multiple identical processes each processing one line at a time.

When running para on very large files where processing time is counted in days rather than minutes, it would be beneficial to be able to restart from a well known point should processing crash the middle instead of having to re-run from the beginning.

In this blog entry I describe in a very short overview an extension to para that supports a simple form of transactions that allows recovery from crashes in a safe and predicable way. More detailed documentation can be found in the github README file

Transactions in para

Transactions in para are committed every N lines where N is specified as a command line parameter. When a transaction is committed a transaction log is written to a file on disk. A transaction log is either new or replaces the previous transaction log, In either case, the log is created atomically.

If a transaction log exist and para is started in recovery mode, para will read the transaction log and adjust it's processing based on the information in the log. For example, say the transaction log contains the following information:

  • #lines committed: 10
  • position in output file: 1038

When para recovers it will skip the first 10 lines and seek to position 1038 in the output stream (if the stream corresponds to a file) and then continue normal processing.

Examples

To check if para will attempt to recover from a stored transaction-log the following command can be executed:

$ para -r

para will output something like:

recovery info --> #lines-committed: 1100, outfile-pos: 133664

Say we have input.txt:


1
2
3
4
5

and an executable exe.bash:

while read line; do
  echo $line 
  sleep 1
done

and we execute (-R is recovery and -C 2 enables transactions every 2 lines):

$>para para -v -i input.txt -o output.out -R -C 2  -- 1 ./exe.bash

the output.txt is then:

1
2
3
4
5

Now, if we execute the same command but hit ^C after seeing the message:

info: committing at 2 lines...

The output is now:

1
2

or possibly:

1
2
3

and the file: .para.txnlog contains (using od -x .para.txnlog):

0000000 0002 0000 0000 0000 0004 0000 0000 0000
0000020

We can view the transaction log by executing:

$para -r

para informs that 2 lines were committed and output in recovery mode starts at position 4 in the output file:

recovery info --> #lines-committed: 2, outfile-pos: 4

Now we can restart para:

$>para para -v -i input.txt -o output.out -R -C 2 -- 1 ./exe.bash

para now writes:

info: skipping first: 2 lines in recovery mode, outfilepos: 4 ...
info: positioning to offset: 4 in output stream
info: committing at 4 lines ...
debug: #timers in queue: 1 (expected: 1 HEARTBEAT timer)
info: committing at 5 lines ...
debug: closing files ...
debug: waiting for child processes ...
debug: cleaning up memory ...
debug: ... cleanup done

We see that para skips 2 lines in the input file and positions at file position 4 in the output file before starting to process. The output file is now:

1
2
3
4
5