Thoughts on Brainfuck decompilation

Yesterday I wrote a primitive Brainfuck decompiler. At the moment the only useful thing it does is that it eliminates >, <, rewrites [ and ] into labels and compresses +,- into "add [0], 4"-like commands. For example:

$ cat bf/hello2.bf 

++++++++++[>+++++++>++++++++++>+++>+<<<<-]
 >++.>+.+++++++..+++.>++.<<+++++++++++++++.
 >.+++.------.--------.>+.>.

$ runhaskell bfdecomp.hs < bf/hello2.bf 
  add [0] 10
  bz  [0] f1
b1:
  add [1] 7
  add [2] 10
  add [3] 3
  add [4] 1
  sub [0] 1
  bnz [0] b1
f1:
  add [1] 2
  put [1]
  add [2] 1
  put [2]
  add [2] 7
  put [2]
  put [2]
  add [2] 3
  put [2]
  add [3] 2
  put [3]
  add [1] 15
  put [1]
  put [2]
  add [2] 3
  put [2]
  sub [2] 6
  put [2]
  sub [2] 8
  put [2]
  add [3] 1
  put [3]
  put [4]

I thought that I could recover some patterns pretty easily, for example, [-] is equvalent to setting the current cell to 0. But why it does it? How to determine whether the cycle terminates? What tools are helpful there?

First of all, I don’t want to make Brainfuck programs completely static, my goal here is to recover as many facts about the program as possible and to do maximum partial evaluation. In general, if a progam does not depend on input, it’s even safe to “replay” it and to emulate its behavior with put $val1; put $val2; .... But here is the first hurdle: what if it is an endless loop? First we have to prove that the loop terminates. That part in general is undecidable (it is basically the halting problem), but some partial evaluations are quite possible.

Recently I had a look at Facebook Infer, a fast static analyser for Java/C programs that infers pre/postconditions of functions on its own. It’s a pretty neat idea and the separation logic behind it looks attractive, so I wanted to apply some of it to the primitiveness of BrainFck.

Let’s apply Hoare triples here. First of all, we have to infer effects of primitive operations (let p be the data pointer, [p] - the contents of the current cell) like that:

add [p] n $p_{post} = p_{pre}, [p]_{post} = [p]_{pre} + n;$

Then we need our program to understand arithmetics: for example, it must know that add [p] n, applied to [p] m times, is equal to add [p] n*m. Then we need to be able to reason about variables: e.g. >++++<- does not move the data pointer, or, in other words, we must be able to figure out that $a_{pre} = a_{0}$ , $a_1 = a_0 + 1$ , $a_2 = a_1 - 1 = a_0 + 1 - 1 = a_0$ , so $a_{pre}=a_{post}$ ; the contents of cells must obey rules: $[a_0]_{post} = [a_0]_{pre} - 1$ , $[a_1]_{post} = [a_1]_{pre} + 4$

get/put are not like other operations: we have to preserve their effects. Any value introduced by get must be “poisoned”, we only know that it may be in range from 0 to 255 and is determined at run-time. When multiplied with another value, it must “poison” the result.

When I had a look at BF self-interpreter, I was a little shocked: I thought that people are not that smart to smash runtime pointers so freely with constructs like [<] (“move left until a cell with zero in it found”) and [-[->]]. Now I have to think how to handle such situations without going crazy and how to recover facts about non-zero data regions. But it looks like here I will fight the concrete wall of Turing completeness