I'm not sure why but I can definitely tell that JOCL is not a simple one to one OpenCL binging, it must dose some optimization somehow.
I rewrite the code from LWJGL's OpenCL to JOCL, and my timing result is here:
LWJGL's OpenCL:
320ms consumed by GPU
80ms for GPU computing and retrieving data back
605ms for CPU computing
JOCL's OpenCL:
157ms consumed by GPU
39ms for GPU computing and retrieving data back
611ms for CPU computing
With the same amount of data and exactly the same kernel, CPU time is not change that much, but GPU time is a big change.
That's amazing.