===========
It seems to me that ARMv7 uprobes need proper icache
flush after xol write. Please look at [1] discussion for similar
issue on ppc.
It seems that flush_dcache_page was sufficient for latter
architectures of PPC but it does not look that it is good enough
for ARMv7.
AFAIK know ARM V7 does not have "snooping Harvard caches"
and needs something like __cpuc_coherent_user_range function call
to sync up icache and dcache after instruction write through
dcache.
Patch that I propose follows this cover letter. There I
introduced weak arch_uprobe_xol_sync_dcache_icache function that
does traditional flush_dcache_page call and I redefined this
function to __cpuc_coherent_user_range call in ARM v7 > case.
[1] http://linux-kernel.2935.n7.nabble.com/Re-PATCH-6-9-uprobes-flush-cache-after-xol-write-td216886.html
Longer story
============
I was trying Dave's armv7 uprobes with SystemTap on Arndale
board. I used Linaro linux branch 3.14 based that contained
Dave's armv7 uprobes topic code. I believe it should be
pretty much the same as armv7 uprobes code that went to Russell's
tree.
I was able to do one function simple test - it worked
fine for me. But when I've tried to run many function like "probe
process("foobar").function("*")" probe SystemTap my target
process always crashed.
After quite a bit of chasing the issue, I was able to come
up with test case that shows several probes installed against
'ls' process. First probe placed at 'push {r4, r5, r6, r7,
r8, r9, r10, r11, lr}' instruction, which is first in
_getopt_initialize function, then script adds few more probes
at _getopt_initialize addresses that are executed latter. And
in those probes I dump registers set and top of stack. By
looking at execution of script one may easily conclude that it
looks like that for each probe 'push {r4, r5, r6, r7, r8, r9,
r10, r11, lr}' instruction is always executed - one may see
36 bytes increase of stack size and see copy of corresponding
registers on the stack.
The code path is the following:
handle_swbp -> pre_ssout
pre_ssout -> xol_get_insn_slot
xol_get_insn_slot -> copy_to_page
xol_get_insn_slot -> flush_dcache_page
pre_ssout -> arch_uprobe_pre_xol
pre_ssout function calls xol_get_insn_slot which finds
available slot in XOL area, that is mapped into user process
and copies required instruction into xol slot. After that it
calls flush_dcache_page, but icache is not flushed in ARM
case by this function. So I think the following thing happens:
first time first xol slot got 'push {r4, r5, r6, r7, r8, r9,
r10, r11, lr}' instruction and it retrieved into icache. Latter
when other probes are executed the same first slot of xol area
it will get different instruction but because icache is not
flushed CPU keep executing 'push' instruction.
When I add the following testing patch that flush icache
in arch_uprobe_pre_xol
[kamensky@kamensky-w530 git]$ git diff
@@ -117,6 +117,8 @@ int arch_uprobe_pre_xol(struct arch_uprobe *auprobe, struct pt_regs *regs)
{
struct uprobe_task *utask = current->utask;
+ __flush_icache_all();
+
if (auprobe->prehandler)
auprobe->prehandler(auprobe, &utask->autask, regs);