The JVM can finally run transformer models without calling out to Python or C++. Five open-source projects now support LLM inference in pure Java, from Llama 3 to Gemma 4, using modern JDK features like the Vector API and Panama FFI to hit roughly 90% of native C performance on larger models.
It started with Llama3.java, built by Alfonso² Peterssen, a GraalVM engineer at Oracle Labs. Inspired by Andrej Karpathy's llama2.c, it's about 2,000 lines in a single file with zero dependencies. Peterssen followed up with Gemma4.java, adding support for Google's Gemma 4 family and nine quantization formats. Both compile to GraalVM Native Images for instant startup. They're educational tools more than production systems, **just like GuppyLM**, but they proved the JVM could actually do this.
Jlama is where things get serious. Built by Jake Luciani, an Apache Cassandra PMC member and former DataStax Chief Architect, it supports Llama, Mistral, Mixtral, Qwen2, IBM Granite, and others. It handles GGUF and HuggingFace SafeTensors formats, includes paged attention and tool calling, and can shard models across multiple JVMs at the layer and attention-head level. There's an OpenAI-compatible REST API and LangChain4j integration. If you want to embed LLM inference in a production Java app right now, Jlama is the most complete option available **for coding agents**.
GPU acceleration comes from two directions. GPULlama3.java uses TornadoVM from the University of Manchester to JIT-compile Java bytecode into CUDA or OpenCL kernels at runtime. You write standard Java, and TornadoVM offloads the heavy lifting to the GPU. Qxotic, from a company called Quixotic AI, takes a modular approach with a unified tensor API targeting CUDA, AMD HIP, and Apple Metal. Quixotic is specifically targeting finance and government deployments where keeping a pure Java stack matters for security and compliance.
The question isn't whether Java can run transformers anymore. It's which of these projects fits your stack.