Machine Learning Research

460 Posts

Apple AI models outperform rivals in instruction accuracy and human text evaluations across devices and servers.

Apple Sharpens Its GenAI Profile: Apple updates its on-device and cloud AI models, introduces a new developer API

Apple revamped two vision-language models in a bid to catch up with fast-moving competitors.

Diagram showing AI pipeline using OCR and LLMs to detect racist clauses in historic California property deeds.

LLM Rights Historical Wrongs: Stanford and Princeton researchers fine-tune a language model to identify racial discrimination in property

In Northern California, old property deeds may still include racial clauses: language, made illegal decades ago, that was designed to ban people of color from owning or living in certain homes.

OpenAI o3-pro outperforms o3 and o1-pro on math, science, and coding benchmarks, but responds much more slowly.

Machine Learning Research

Better Video, Fewer Tokens: STORM Processes Fewer Tokens And Still Beats GPT-4o On Video Understanding Benchmarks

Researchers reduced the number of tokens needed to represent video frames to be fed to a transformer.

The FLUX.1 Kontext family of image generators from Black Forest Labs edits images to remove or add objects, apply art styles, and extract details.

Machine Learning Research

More Consistent Characters and Styles: Black Forest Labs Launches FLUX.1 Kontext for Generating and Alterating Images with Consistent Details

Same character, new background, new action. That’s the focus of the latest text-to-image models from Germany’s Black Forest Labs.

Diagram showing how a language model agent gets misled by malicious posts and sites when searching for Nike shoes online.

Machine Learning Research

Phishing for Agents: Columbia University researchers show how to trick trusting AI agents with poisoned links

Researchers identified a simple way to mislead autonomous agents based on large language models.

Bar graph comparing AI model accuracies for AIME 2024-2025, GPQA, LiveCodeBench, Aider, and Humanity's Last Exam.

Machine Learning Research

Next-Level DeepSeek-R1: DeepSeek-R1’s update leads all open models and brings it up to date with the latest from Google and OpenAI

DeepSeek updated its groundbreaking DeepSeek-R1 large language model to strike another blow for open-weights performance.

DeepSeek computation diagram showing transformer blocks, multi-head attention, and routing, using FP8 and BF16 precision.

Machine Learning Research

How DeepSeek Did It: Researchers describe training methods and hardware choices for DeepSeek’s V3 and R1 models

DeepSeek made headlines late last year, when it built a state-of-the-art, open-weights large language model at a cost far lower than usual. The upstart developer shared new details about its method.

Side-by-side of a fern leaf and its digital code representation, illustrating nature's pattern-to-code transformation.

Machine Learning Research

Google I/O Overdrive: Google’s new AI offerings include Veo 3 video generator, lightweight Gemma 3n, updates to Gemini Pro and Ultra, and more

Google revamped its roster of models, closed and open, and added more AI-powered features to its existing products.

AI model performance comparison chart: Claude Opus 4, Sonnet 4, Sonnet 3.7, OpenAI o3, GPT-4.1, and Gemini 2.5 Pro.

Machine Learning Research

Claude 4 Advances Code Generation: Anthropic debuts new Claude 4 Sonnet and Claude 4 Opus models, featuring top benchmarks in coding

Anthropic continued its tradition of building AI models that raise the bar in coding tasks.

Diagram of FP4 training scheme showing BF16 tensor quantization and FP4 tensor core processing for efficient computation.

Machine Learning Research

4-Bit Efficiency, 16-Bit Accuracy: Microsoft researchers show that heavily quantized versions of Llama can perform as well as near-full-precision

Using an 8-bit number format like FP8 during training saves computation compared to 16- or 32-bit formats, but it can yield less-accurate results. Researchers trained models using 4-bit numbers without sacrificing accuracy.

Chat interface discussing code error with special character filenames. Terminal shows Unix commands for troubleshooting.

Machine Learning Research

Your Robot Dev Team: OpenAI introduces Codex, a multi-agent cloud-based software engineering tool in ChatGPT

OpenAI launched an agentic software-development system.

Dual line graphs showing factual QA accuracy and NLL against memory size for NQ and TQA datasets in AI models.

Machine Learning Research

Memory Layers for More-Factual Output: Meta researchers build Llama-style models that recall details without needing more computing resources

Improving a large language model’s factual accuracy typically requires making it bigger, which in turn, involves more computation. Researchers devised an architecture that enables models to recall relevant details without significantly increasing the amount of computation required.

Comparison table of AI models ranked by LCB score and Codeforces rating with percentiles for competitive programming.

Machine Learning Research

Open, Compact Code Generator: DeepCoder-14B-Preview further fine-tunes reasoning models for coding

An open-source code generator performs comparably to the reasoning models DeepSeek-R1 and OpenAI o1 with a much smaller model.

Table comparing AI model accuracy on math and reasoning benchmarks including AIME, HMMT, OmniMath, GPQA-D, and Codeforces.

Machine Learning Research

Reasoning Models With Recipes: Microsoft unveils training details for Phi-4-reasoning, Phi-4-reasoning-plus, and Phi-4-mini-reasoning

Microsoft published its latest recipe for training reasoning models, substantially expanding what is still a fairly small base of public knowledge.

Machine Learning Research

Apple Sharpens Its GenAI Profile: Apple updates its on-device and cloud AI models, introduces a new developer API

LLM Rights Historical Wrongs: Stanford and Princeton researchers fine-tune a language model to identify racial discrimination in property

More Reasoning for Harder Problems: OpenAI debuts o3-pro, an updated reasoning model that applies more tokens at inference

Better Video, Fewer Tokens: STORM Processes Fewer Tokens And Still Beats GPT-4o On Video Understanding Benchmarks

More Consistent Characters and Styles: Black Forest Labs Launches FLUX.1 Kontext for Generating and Alterating Images with Consistent Details

Phishing for Agents: Columbia University researchers show how to trick trusting AI agents with poisoned links

Next-Level DeepSeek-R1: DeepSeek-R1’s update leads all open models and brings it up to date with the latest from Google and OpenAI

How DeepSeek Did It: Researchers describe training methods and hardware choices for DeepSeek’s V3 and R1 models

Google I/O Overdrive: Google’s new AI offerings include Veo 3 video generator, lightweight Gemma 3n, updates to Gemini Pro and Ultra, and more

Claude 4 Advances Code Generation: Anthropic debuts new Claude 4 Sonnet and Claude 4 Opus models, featuring top benchmarks in coding

4-Bit Efficiency, 16-Bit Accuracy: Microsoft researchers show that heavily quantized versions of Llama can perform as well as near-full-precision

Your Robot Dev Team: OpenAI introduces Codex, a multi-agent cloud-based software engineering tool in ChatGPT

Memory Layers for More-Factual Output: Meta researchers build Llama-style models that recall details without needing more computing resources

Open, Compact Code Generator: DeepCoder-14B-Preview further fine-tunes reasoning models for coding

Reasoning Models With Recipes: Microsoft unveils training details for Phi-4-reasoning, Phi-4-reasoning-plus, and Phi-4-mini-reasoning

Subscribe to The Batch