Introduction
In a previous blog entry, Efficient processing of line based text, I described the program para. In the blog I described how processing time can be significantly reduced by using para to spread the load of processing lines from an input stream across multiple identical processes each processing one line at a time.
When running para on very large files where processing time is counted in days rather than minutes, it would be beneficial to be able to restart from a well known point should processing crash the middle instead of having to re-run from the beginning.
In this blog entry I describe in a very short overview an extension to para that supports a simple form of transactions that allows recovery from crashes in a safe and predicable way. More detailed documentation can be found in the github README file
Transactions in para
Transactions in para are committed every N lines where N is specified as a command line parameter. When a transaction is committed a transaction log is written to a file on disk. A transaction log is either new or replaces the previous transaction log, In either case, the log is created atomically.
If a transaction log exist and para is started in recovery mode, para will read the transaction log and adjust it's processing based on the information in the log. For example, say the transaction log contains the following information:
- #lines committed: 10
- position in output file: 1038
When para recovers it will skip the first 10 lines and seek to position 1038 in the output stream (if the stream corresponds to a file) and then continue normal processing.
Examples
To check if para will attempt to recover from a stored transaction-log the following command can be executed:
$ para -r
para will output something like:
recovery info --> #lines-committed: 1100, outfile-pos: 133664
Say we have input.txt:
1 2 3 4 5
and an executable exe.bash:
while read line; do echo $line sleep 1 done
and we execute (-R is recovery and -C 2 enables transactions every 2 lines):
$>para para -v -i input.txt -o output.out -R -C 2 -- 1 ./exe.bash
the output.txt is then:
1 2 3 4 5
Now, if we execute the same command but hit ^C after seeing the message:
info: committing at 2 lines...
The output is now:
1 2
or possibly:
1 2 3
and the file: .para.txnlog contains (using od -x .para.txnlog):
0000000 0002 0000 0000 0000 0004 0000 0000 0000 0000020
We can view the transaction log by executing:
$para -r
para informs that 2 lines were committed and output in recovery mode starts at position 4 in the output file:
recovery info --> #lines-committed: 2, outfile-pos: 4
Now we can restart para:
$>para para -v -i input.txt -o output.out -R -C 2 -- 1 ./exe.bash
para now writes:
info: skipping first: 2 lines in recovery mode, outfilepos: 4 ... info: positioning to offset: 4 in output stream info: committing at 4 lines ... debug: #timers in queue: 1 (expected: 1 HEARTBEAT timer) info: committing at 5 lines ... debug: closing files ... debug: waiting for child processes ... debug: cleaning up memory ... debug: ... cleanup done
We see that para skips 2 lines in the input file and positions at file position 4 in the output file before starting to process. The output file is now:
1 2 3 4 5