LLM Inference using 100% Modern Java ☕️🔥

Stephan Janssen

Published Oct 21, 2024

In the rapidly evolving world of (Gen)AI, Java developers now have powerful new (LLM Inference) tools at their disposal: Llama3.java and JLama.

These projects brings the capabilities of large language models (LLMs) to the Java ecosystem, offering an exciting opportunity for developers to integrate advanced language processing into their applications.

Here's an example of Llama3.java providing inference for the DevoxxGenie IDEA plugin.

The JLama Project

JLama (a 100% Java inference engine) is developed by Jake Luciani and supports a whole range of LLM's :

Gemma & Gemma 2 Models
Llama & Llama2 & Llama3 Models
Mistral & Mixtral Models
Qwen2 Models
GPT-2 Models
BERT Models
BPE Tokenizers
WordPiece Tokenizers

Here's his Devoxx Belgium 2024 presentation with more information and demo's.

From a features perspective this is the most advanced Java implementation currently available. He even supports LLM sharding on layers and head attn level 🤩

Features includes:

Paged Attention
Mixture of Experts
Tool Calling
Generate Embeddings
Classifier Support
Huggingface SafeTensors model and tokenizer format
Support for F32, F16, BF16 types
Support for Q8, Q4 model quantization
Fast GEMM operations
Distributed Inference!

Jlama requires Java 20 or later and utilises the new Vector API for faster inference.

You can easily run JLama on your computer, on Apple Silicon make sure you have an ARM64 based SDK.

Now you can start jlama with the restapi param and the optional auto-download to start the inference service.

Experimental JLama and DevoxxGenie integration

Alina and Alfonso at Devoxx Belgium 2024

The JLama3.java Project

The Llama3.java is also a 100% Java implementation developed by Alfonso² Peterssen and inspired by Andrej Karpathy.

Features includes:

Single file, no dependencies
GGUF format parser
Llama 3 tokenizer based on minbpe
Llama 3 inference with Grouped-Query Attention
Support Llama 3.1 (ad-hoc RoPE scaling) and 3.2 (tie word embeddings)
Support for Q8_0 and Q4_0 quantizations
Fast matrix-vector multiplication routines for quantized tensors using Java's Vector API
Simple CLI with --chat and --instruct modes.
GraalVM's Native Image support (EA builds here)
AOT model pre-loading for instant time-to-first-token

Here's the Devoxx Belgium 2024 presentation by Alfonso and Alina.

Llama3.java + (OpenAI) REST API

The Llama3.java doesn't have any REST interface so I decided to contribute that part ❤️

I've added a Spring Boot wrapper around the core Llama3.java library, allowing developers to easily set up and run an OpenAI-compatible REST API for text generation and chat completions. The goal is to use this as the 100% Java inference engine for the IDEA DevoxxGenie plugin. Allowing local inference using a complete Java solution.

Code is available on GitHub

For the time being I've copied the Llama3.java source code into my project but ideally this should be integrated as a Maven dependency.

Key Features

OpenAI-compatible API: The project implements an API that mimics OpenAI's chat completions endpoint, making it easy to integrate with existing applications.
Support for GGUF Models: Llama3.java can work with GGUF (GPT-Generated Unified Format) models, which are optimised for efficiency and performance.
Vector API Utilization: The project leverages Java's incubator Vector API for improved performance on matrix operations.
Cross-Platform Compatibility: While optimized for Apple Silicon (M1/M2/M3), the project can run on various platforms with the appropriate Java SDK.

Getting Started

To get started with Llama3.java, follow these steps:

Setup: Ensure you have a compatible Java SDK installed. For Apple Silicon users, an ARM-compliant SDK is recommended.
Build: Use Maven to build the project with "mvn clean package".
Download a Model: Obtain a GGUF model from the Hugging Face model hub and place it in the 'models' directory.
Configure: Update the application.properties file with your model details and server settings.
Run: Start the Spring Boot application using the provided Java command.

DevoxxGenie

When the Llama3.java Spring Boot application is running, you can use DevoxxGenie for local inference 🤩

Future Directions

The next step is to move the MatMul bottleneck to the GPU using TornadoVM. Also once GraalVM supports

Externalise Llama3.java as a maven service dependency (if/when available)
Add GPU support using TornadoVM
GraalVM native versions 🍏
LLM sharding capabilities
Optional: Support for BitNets and Ternary Models

Conclusion

Llama3.java and JLama represents a significant step forward in bringing large language model capabilities to the Java ecosystem. By providing an easy-to-use, OpenAI-compatible API and leveraging Java's latest performance features, this project opens up new possibilities for AI-driven applications in Java.

Whether you're building a chatbot, a content generation tool, or any application that could benefit from advanced language processing, Llama3.java and JLama offers a promising solution.

As these projects continues to evolve and optimise, it's well worth keeping an eye on for Java developers interested in the cutting edge of AI technology.

Exciting times for Java Developers! ☕️🔥❤️

-Stephan

Rodrigo Dias de Oliveira

Project & Research Manager @ Triminds. Dev focused on applied AI, cloud, and software engineering. Java | Python | C# | OCI | Azure | AWS | Web | Desktop | Mobile

6mo

yes, its possible!...kkkk

Kevin Tan H.A.

In the business of organizing information with technology and potentially AI. Building in private.

6mo

Can it speed up qwen2.5-coder-32b?

Arquimedes Eloia

Software Engineer | Tech Lead at Stefanini Brasil

7mo

Sensacional!

Thorsten Roland

There´s no cooler thing than Quantum Coding

7mo

Great news for the Java community.

Rodrigo Dias de Oliveira

Project & Research Manager @ Triminds. Dev focused on applied AI, cloud, and software engineering. Java | Python | C# | OCI | Azure | AWS | Web | Desktop | Mobile

7mo

Uau!

See more comments

To view or add a comment, sign in

See all

LLM Inference using 100% Modern Java ☕️🔥

Stephan Janssen

The JLama Project

The JLama3.java Project

Llama3.java + (OpenAI) REST API

Key Features

Getting Started

DevoxxGenie

Future Directions

Conclusion

More articles by this author

Insights from the community

Others also viewed

Best Practices for Testing GenAI Applications in Java

Glancing into the Future: Java plans for 2025, Machine Learning on GC logs and agentic software - JVM Weekly vol. 116

Why Did String Templates Have to Die? For the Good of Us All - JVM Weekly vol. 89

Spring AI: Connect AI with Your Java Apps

Java Parallel GC Tuning

GenAI for Enterprise Java Developers - Part 1

Langchain4j: The "Pragmatic AI" Framework for Java Developers

Leading Java AI Frameworks: LangChain4j vs Spring AI for Custom Chatbots 🥊

Supercharge Your Java Application with Generative AI

Java for Everyone...Strings

Explore topics

The JLama Project

The JLama3.java Project

Llama3.java + (OpenAI) REST API

Key Features

Getting Started

DevoxxGenie

Future Directions

Conclusion

The Rise of the Agents @ Devoxx Belgium

Jun 2, 2025

10K+ Downloads Milestone for DevoxxGenie!

Jan 31, 2025

Running the full DeepSeek R1 model at Home or in the Cloud?

Jan 29, 2025

Large Language Models related (study) material

Jan 19, 2025

Basketball Game Analysis using an LLM

Sep 10, 2024

The Power of Full Project Context #LLM

Jul 3, 2024

Using LLM's to describe images

Jun 6, 2024

Devoxx Genie Plugin : an Update

May 28, 2024

MLX on Apple silicon

Dec 7, 2023

Streamlining Your IDE with a Local LLM AI Assistant: A Quick Guide

Nov 30, 2023

Insights from the community

Others also viewed

Best Practices for Testing GenAI Applications in Java

Glancing into the Future: Java plans for 2025, Machine Learning on GC logs and agentic software - JVM Weekly vol. 116

Why Did String Templates Have to Die? For the Good of Us All - JVM Weekly vol. 89

Spring AI: Connect AI with Your Java Apps

Java Parallel GC Tuning

GenAI for Enterprise Java Developers - Part 1

Langchain4j: The "Pragmatic AI" Framework for Java Developers

Leading Java AI Frameworks: LangChain4j vs Spring AI for Custom Chatbots 🥊

Supercharge Your Java Application with Generative AI

Java for Everyone...Strings

Explore topics