Using LLM's to describe images

Stephan Janssen

Published Jun 6, 2024

I've already worked on face recognition many years ago, so the natural next step is to use a Large Language Model (LLM) to describe what's in the photo 🤩

With both local & cloud based LLM's this is actually a very, VERY EASY!

First, here's the description of the above image using a local LLM:

"I see an image of a person sitting at a table with a sign that reads "WE'RE HIRING." The individual appears to be resting their head on one hand, and there are various items on the table including what looks like paperwork, possibly related to job applications or hiring. The background shows a banner with additional text that includes "highlights," "PROBLEM SOLVERS," and some contact information, suggesting this might be at a recruitment event or fair where companies are showcasing their open positions."

The LLM I used is the LLaVa multimodal model :

On my MacBook Pro it takes approximately 4 seconds to analyse an image of 2048 by 1366 px using Java 21 and LangChain4J 0.31.0.

How is it done?

3 simple Java statements 🫣

Create a ChatModel using a local Ollama instance and using the LLaVa model
Create your user message with the prompt and image URL
Execute the user prompt using the generate method

That's it

GPT-4o

This is what OpenAI GPT-4o gives using LangChain4J for the same photo:

The image shows a woman sitting at a table in front of a blue backdrop with the words "We're hiring" written in large letters. She is wearing a dark-colored shirt and holding a phone. On the table, there are several white water bottles and some informational pamphlets or cards. The backdrop also has additional text that includes words like "Highflyers," "Problem Solvers," and possibly more. The setting appears to be a job fair or recruitment event.

You do need an API key for this cloud based model.

What's interesting to see is when I ask ChatGPT using GPT-4o I get a much more detailed description, probably because OpenAI does some extra prompt magic as the system message?

"This photo captures a scene at a hiring or recruitment event. Here's a breakdown of the details:

Setting: The backdrop prominently displays the words "We're hiring," indicating that the event is focused on recruitment. The background is blue with white text, making the message stand out clearly.
Person: A woman with long blonde hair is seated, holding a smartphone. She is wearing a dark blue T-shirt with the logo and name "Auto Trader" along with a badge or lanyard that might identify her as a representative of the company.
Table Setup: On the table in front of her, there are several white water bottles, flyers, and business cards. This setup suggests that the table is likely a booth where the company is distributing promotional materials and information about job opportunities.
Additional Text: Below the primary "We're hiring" message, other encouraging terms like "Highflyers," "Innovators," and "Problem Solvers" are visible. These terms are likely used to attract potential candidates who identify with these qualities.

Overall, the image conveys a professional and organized recruitment effort by Auto Trader, aimed at attracting talented individuals to join their team."

Here are some more results using the local LLaVa model:

The next step is doing semantic search in combination with the speaker recognition. so I can type : "Give me all photos where the Devoxx letters are visible with Venkat speaking" 🤩

Good stuff!

PS: Please let me know which other local multimodal models I should try-out.

Using LLM's to describe images

Stephan Janssen

How is it done?

GPT-4o

More articles by this author

Insights from the community

Explore topics

How is it done?

GPT-4o

The Rise of the Agents @ Devoxx Belgium

Jun 2, 2025

10K+ Downloads Milestone for DevoxxGenie!

Jan 31, 2025

Running the full DeepSeek R1 model at Home or in the Cloud?

Jan 29, 2025

Large Language Models related (study) material

Jan 19, 2025

LLM Inference using 100% Modern Java ☕️🔥

Oct 21, 2024

Basketball Game Analysis using an LLM

Sep 10, 2024

The Power of Full Project Context #LLM

Jul 3, 2024

Devoxx Genie Plugin : an Update

May 28, 2024

MLX on Apple silicon

Dec 7, 2023

Streamlining Your IDE with a Local LLM AI Assistant: A Quick Guide

Nov 30, 2023

Insights from the community

Explore topics