LLM Inference using 100% Modern Java ☕️🔥
In the rapidly evolving world of (Gen)AI, Java developers now have powerful new (LLM Inference) tools at their disposal: Llama3.java and JLama.
These projects brings the capabilities of large language models (LLMs) to the Java ecosystem, offering an exciting opportunity for developers to integrate advanced language processing into their applications.
Here's an example of Llama3.java providing inference for the DevoxxGenie IDEA plugin.
The JLama Project
JLama (a 100% Java inference engine) is developed by Jake Luciani and supports a whole range of LLM's :
Gemma & Gemma 2 Models
Llama & Llama2 & Llama3 Models
Mistral & Mixtral Models
Qwen2 Models
GPT-2 Models
BERT Models
BPE Tokenizers
WordPiece Tokenizers
Here's his Devoxx Belgium 2024 presentation with more information and demo's.
From a features perspective this is the most advanced Java implementation currently available. He even supports LLM sharding on layers and head attn level 🤩
Features includes:
Paged Attention
Mixture of Experts
Tool Calling
Generate Embeddings
Classifier Support
Huggingface SafeTensors model and tokenizer format
Support for F32, F16, BF16 types
Support for Q8, Q4 model quantization
Fast GEMM operations
Distributed Inference!
Jlama requires Java 20 or later and utilises the new Vector API for faster inference.
You can easily run JLama on your computer, on Apple Silicon make sure you have an ARM64 based SDK.
Now you can start jlama with the restapi param and the optional auto-download to start the inference service.
The JLama3.java Project
The Llama3.java is also a 100% Java implementation developed by Alfonso² Peterssen and inspired by Andrej Karpathy.
Features includes:
Single file, no dependencies
GGUF format parser
Llama 3 tokenizer based on minbpe
Llama 3 inference with Grouped-Query Attention
Support Llama 3.1 (ad-hoc RoPE scaling) and 3.2 (tie word embeddings)
Support for Q8_0 and Q4_0 quantizations
Fast matrix-vector multiplication routines for quantized tensors using Java's Vector API
Simple CLI with --chat and --instruct modes.
GraalVM's Native Image support (EA builds here)
AOT model pre-loading for instant time-to-first-token
Here's the Devoxx Belgium 2024 presentation by Alfonso and Alina.
Llama3.java + (OpenAI) REST API
The Llama3.java doesn't have any REST interface so I decided to contribute that part ❤️
I've added a Spring Boot wrapper around the core Llama3.java library, allowing developers to easily set up and run an OpenAI-compatible REST API for text generation and chat completions. The goal is to use this as the 100% Java inference engine for the IDEA DevoxxGenie plugin. Allowing local inference using a complete Java solution.
Code is available on GitHub
For the time being I've copied the Llama3.java source code into my project but ideally this should be integrated as a Maven dependency.
Key Features
OpenAI-compatible API: The project implements an API that mimics OpenAI's chat completions endpoint, making it easy to integrate with existing applications.
Support for GGUF Models: Llama3.java can work with GGUF (GPT-Generated Unified Format) models, which are optimised for efficiency and performance.
Vector API Utilization: The project leverages Java's incubator Vector API for improved performance on matrix operations.
Cross-Platform Compatibility: While optimized for Apple Silicon (M1/M2/M3), the project can run on various platforms with the appropriate Java SDK.
Getting Started
To get started with Llama3.java, follow these steps:
Setup: Ensure you have a compatible Java SDK installed. For Apple Silicon users, an ARM-compliant SDK is recommended.
Build: Use Maven to build the project with "mvn clean package".
Download a Model: Obtain a GGUF model from the Hugging Face model hub and place it in the 'models' directory.
Configure: Update the application.properties file with your model details and server settings.
Run: Start the Spring Boot application using the provided Java command.
DevoxxGenie
When the Llama3.java Spring Boot application is running, you can use DevoxxGenie for local inference 🤩
Future Directions
The next step is to move the MatMul bottleneck to the GPU using TornadoVM. Also once GraalVM supports
Externalise Llama3.java as a maven service dependency (if/when available)
Add GPU support using TornadoVM
GraalVM native versions 🍏
LLM sharding capabilities
Optional: Support for BitNets and Ternary Models
Conclusion
Llama3.java and JLama represents a significant step forward in bringing large language model capabilities to the Java ecosystem. By providing an easy-to-use, OpenAI-compatible API and leveraging Java's latest performance features, this project opens up new possibilities for AI-driven applications in Java.
Whether you're building a chatbot, a content generation tool, or any application that could benefit from advanced language processing, Llama3.java and JLama offers a promising solution.
As these projects continues to evolve and optimise, it's well worth keeping an eye on for Java developers interested in the cutting edge of AI technology.
Exciting times for Java Developers! ☕️🔥❤️
-Stephan
Project & Research Manager @ Triminds. Dev focused on applied AI, cloud, and software engineering. Java | Python | C# | OCI | Azure | AWS | Web | Desktop | Mobile
6moyes, its possible!...kkkk
In the business of organizing information with technology and potentially AI. Building in private.
6moCan it speed up qwen2.5-coder-32b?
Software Engineer | Tech Lead at Stefanini Brasil
7moSensacional!
There´s no cooler thing than Quantum Coding
7moGreat news for the Java community.
Project & Research Manager @ Triminds. Dev focused on applied AI, cloud, and software engineering. Java | Python | C# | OCI | Azure | AWS | Web | Desktop | Mobile
7moUau!