VOOZH about

URL: https://dev.to/fosteman/100-years-later-apple-finally-shipped-local-multimodal-in-xcode-27-beta-nmc

⇱ 6 months later: Apple Finally Shipped Local Multimodal in Xcode 27 Beta - DEV Community


A while ago I wrote a full llama.cpp iOS implementation using Obj-c bridge because I wanted one thing:

image in -> structured JSON out -> no cloud required.

It worked. It was fast enough. It was also a lot of plumbing:

  • XCFramework builds
  • ObjC++ bridge
  • tokenizer/eval/sampling internals
  • model + projector file choreography
  • JSON guardrails everywhere

Now, about 6 months later, Apple dropped Foundation Models image analysis in Xcode 27.0 beta, and i can finally call a serious on-device model without keeping that whole engine room by myself.

Analyzing images with multimodal prompting | Apple Developer Documentation

Analyze and extract information from images by combining them with descriptive text prompts.

👁 favicon
developer.apple.com

With Foundation Models, the core API is basically:

import FoundationModels

@Generable
struct ReceiptExtraction: Codable {
 var vendor_name: String
 var transaction_date: String
 var total_amount: Double
 var currency: String
 var category: String
 var line_items: [String]
}

let session = LanguageModelSession(model: .default)

let response = try await session.respond(
 generating: ReceiptExtraction.self,
 options: GenerationOptions(
 sampling: .random(top: 20, seed: 1111),
 temperature: 0.1,
 maximumResponseTokens: 384
 )
) {
 """
 Extract receipt information for bookkeeping.
 Return schema-compliant structured output only.
 Format fields for QuickBooks ingestion.
 """
 Attachment(cgImage, orientation: .right)
}

let result = response.content

Receipt image in → QuickBooks-ready JSON out.

No bridge.
No gguf.
No mmproj.
No custom decode loop.

Before

  • llama.cpp vendor management
  • ObjC++ wrappers and thread safety
  • bespoke schema/prompt failover handling
  • app startup warmups with model files in bundle

Now

  • native LanguageModelSession
  • native Attachment(...) for images
  • native structured generation with @Generable
  • native prewarm and model availability checks
  • native Instruments.app profiling available

And that is exactly where it should have been from day one fiddling with multi-modal inference.