Using LLM's to describe images
Devoxx UK 2024 photo from Flickr

Using LLM's to describe images

I've already worked on face recognition many years ago, so the natural next step is to use a Large Language Model (LLM) to describe what's in the photo 🤩

With both local & cloud based LLM's this is actually a very, VERY EASY!

First, here's the description of the above image using a local LLM:

"I see an image of a person sitting at a table with a sign that reads "WE'RE HIRING." The individual appears to be resting their head on one hand, and there are various items on the table including what looks like paperwork, possibly related to job applications or hiring. The background shows a banner with additional text that includes "highlights," "PROBLEM SOLVERS," and some contact information, suggesting this might be at a recruitment event or fair where companies are showcasing their open positions."

The LLM I used is the LLaVa multimodal model :

On my MacBook Pro it takes approximately 4 seconds to analyse an image of 2048 by 1366 px using Java 21 and LangChain4J 0.31.0.

How is it done?

3 simple Java statements 🫣

  1. Create a ChatModel using a local Ollama instance and using the LLaVa model

  2. Create your user message with the prompt and image URL

  3. Execute the user prompt using the generate method

That's it

GPT-4o

This is what OpenAI GPT-4o gives using LangChain4J for the same photo:

The image shows a woman sitting at a table in front of a blue backdrop with the words "We're hiring" written in large letters. She is wearing a dark-colored shirt and holding a phone. On the table, there are several white water bottles and some informational pamphlets or cards. The backdrop also has additional text that includes words like "Highflyers," "Problem Solvers," and possibly more. The setting appears to be a job fair or recruitment event.

You do need an API key for this cloud based model.

What's interesting to see is when I ask ChatGPT using GPT-4o I get a much more detailed description, probably because OpenAI does some extra prompt magic as the system message?


"This photo captures a scene at a hiring or recruitment event. Here's a breakdown of the details:

  1. Setting: The backdrop prominently displays the words "We're hiring," indicating that the event is focused on recruitment. The background is blue with white text, making the message stand out clearly.

  2. Person: A woman with long blonde hair is seated, holding a smartphone. She is wearing a dark blue T-shirt with the logo and name "Auto Trader" along with a badge or lanyard that might identify her as a representative of the company.

  3. Table Setup: On the table in front of her, there are several white water bottles, flyers, and business cards. This setup suggests that the table is likely a booth where the company is distributing promotional materials and information about job opportunities.

  4. Additional Text: Below the primary "We're hiring" message, other encouraging terms like "Highflyers," "Innovators," and "Problem Solvers" are visible. These terms are likely used to attract potential candidates who identify with these qualities.

Overall, the image conveys a professional and organized recruitment effort by Auto Trader, aimed at attracting talented individuals to join their team."


Here are some more results using the local LLaVa model:

Devoxx UK 2024
DevoxxUK 2024

The next step is doing semantic search in combination with the speaker recognition. so I can type : "Give me all photos where the Devoxx letters are visible with Venkat speaking" 🤩

Good stuff!

PS: Please let me know which other local multimodal models I should try-out.

Zeshan Abdullah

Web Developer at Fiverr

7mo

Flickr Downloader is a free tool for downloading HD quality images from Flickr.com https://hdstockimages.com/flickr-downloader/

Like
Reply

To view or add a comment, sign in

Insights from the community

Explore topics