Well CUDA is a modification of C, to write CUDA kernel you have to code in C, and then compile to executable form with nvidia's CUDA compiler. Produced native code could then be linked with Java using JNI. So technically you can't write kernel code from Java. There is JCUDA http://www.jcuda.de/jcuda/JCuda.html, it provides you with cuda's apis for general memory/device menagement and some Java methods that are implemented in CUDA and JNI wrapped (FFT, some linear algebra methods.. etc etc..).
On the other hand OpenCL is just an API. OpenCL kernels are plain strings passed to the API so using OpenCL from Java you should be able to specify your own kernels. OpenCL binding for java can be found here http://www.jocl.org/.
The main disadvantage of OpenCL over CUDA (at least for me) is the lack of available libraries (Thrust, CUDPP, etc). However CUDA can be easily ported to OpenCL, and by looking at how those libraries work (algorithms, strategies, etc) is actually very nice as you learn a lot with it.
AFAIK, JavaCL / OpenCL4Java is the only OpenCL binding that is available on all platforms right now (including MacOS X, FreeBSD, Linux, Windows, Solaris, all in Intel 32, 64 bits and ppc variants, thanks to its use of JNA).
It has demos that actually run fine from Java Web Start at least on Mac and Windows (to avoid random crashes on Linux, please see this wiki page, such as this Particles Demo.
It also comes with a few utilities (GPGPU random number generation, basic parallel reduction, linear algebra) and a Scala DSL.
I have not worked with it but seems much easier to use than other solutions.
From the project page:
Rootbeer is more advanced than CUDA or OpenCL Java Language Bindings. With bindings the developer must serialize complex graphs of objects into arrays of primitive types. With Rootbeer this is done automatically. Also with language bindings, the developer must write the GPU kernel in CUDA or OpenCL. With Rootbeer a static analysis of the Java Bytecode is done (using Soot) and CUDA code is automatically generated.
If you want to do some image processing or geometric operations, you may want a linear algebra library with gpu support (with CUDA for instance). I would suggest you ND4J witch is the linear algrebra with CUDA GPU support on which DeepLearning4J is built. With that you don't have to deal with CUDA directly and have to low level code in c. Plus if you want to do more stuff with image with DL4J you will have access to specific image processing operations such as convolution.