VOOZH about

URL: https://dev.to/diyoraharshit52/mediapipe-on-ios-what-the-docs-leave-out-22bl

⇱ MediaPipe on iOS: what the docs leave out - DEV Community


The MediaPipe Tasks documentation covers installation, initialization, and a working demo. It does not cover what happens when you try to ship a production app. Six months of building CareSpace AI — a real-time physiotherapy app that runs pose detection at 30fps on every frame — filled in those gaps.

Here's what the docs leave out.


AVCaptureSession setup that actually works

The docs show a generic camera setup. This is the production-tuned configuration that keeps latency low and prevents dropped frames:

class CameraManager: NSObject {
 private let captureSession = AVCaptureSession()
 private let videoOutput = AVCaptureVideoDataOutput()
 private let processingQueue = DispatchQueue(label: "com.app.camera", qos: .userInitiated)

 func configure() throws {
 captureSession.beginConfiguration()

 // VGA is the MediaPipe sweet spot — higher resolution adds latency without accuracy gains
 captureSession.sessionPreset = .vga640x480

 guard let device = AVCaptureDevice.default(.builtInWideAngleCamera, for: .video, position: .front),
 let input = try? AVCaptureDeviceInput(device: device) else {
 throw CameraError.deviceUnavailable
 }
 captureSession.addInput(input)

 // Lock frame rate — variable frame rate causes timestamp jitter in MediaPipe
 try device.lockForConfiguration()
 device.activeVideoMinFrameDuration = CMTime(value: 1, timescale: 30)
 device.activeVideoMaxFrameDuration = CMTime(value: 1, timescale: 30)
 device.unlockForConfiguration()

 // BGRA is what MediaPipe expects — do not use MJPEG or YUV without conversion
 videoOutput.videoSettings = [
 kCVPixelBufferPixelFormatTypeKey as String: kCVPixelFormatType_32BGRA
 ]
 videoOutput.alwaysDiscardsLateVideoFrames = true
 videoOutput.setSampleBufferDelegate(self, queue: processingQueue)
 captureSession.addOutput(videoOutput)

 captureSession.commitConfiguration()
 }
}

Two things that aren't obvious: frame rate locking and pixel format. Without frame rate locking, AVFoundation's automatic frame rate control kicks in when the CPU is under load — and sends MediaPipe frames with inconsistent timestamps, which breaks temporal smoothing. Without kCVPixelFormatType_32BGRA, you're doing an implicit conversion on every frame.


GPU vs CPU delegate — it's not what you think

The docs say: "Use GPU delegate for better performance." That's true for benchmarks. In production, it's more complicated.

// GPU delegate — fast, but causes thermal issues in long sessions
let gpuOptions = PoseLandmarkerOptions()
gpuOptions.baseOptions.delegate = .GPU

// CPU delegate — slower per-frame, but thermally stable
let cpuOptions = PoseLandmarkerOptions()
cpuOptions.baseOptions.delegate = .CPU

The GPU delegate runs 20–30% faster per frame. But on an iPhone 12, after about 8 minutes of continuous use with the GPU delegate, the device throttles and frame rate drops from 30fps to 18fps. With the CPU delegate, the same session runs at a stable 28fps for 20+ minutes.

For a physiotherapy app where sessions can run 15–30 minutes, we use:

  • CPU delegate by default
  • GPU delegate only if the device is iPhone 14 Pro or newer (better thermal headroom)
let delegate: Delegate = {
 if ProcessInfo.processInfo.thermalState == .nominal &&
 UIDevice.current.modelIdentifier >= "iPhone14,2" {
 return .GPU
 }
 return .CPU
}()

Coordinate space mapping — the front camera trap

MediaPipe returns landmark coordinates in normalized space (0,0) to (1,1), relative to the camera frame. This seems straightforward until you try to overlay landmarks on a live preview with a front-facing camera.

Two transforms are required:

func convertToViewCoordinates(
 landmark: NormalizedLandmark,
 in viewBounds: CGRect,
 isFrontCamera: Bool
) -> CGPoint {
 var x = CGFloat(landmark.x)
 var y = CGFloat(landmark.y)

 // Front camera: mirror X coordinate
 // MediaPipe does NOT mirror automatically for the front camera
 if isFrontCamera {
 x = 1.0 - x
 }

 // Portrait mode: swap X and Y, then apply bounds
 // Camera captures in landscape; device is in portrait
 let rotatedX = y
 let rotatedY = 1.0 - x

 return CGPoint(
 x: rotatedX * viewBounds.width,
 y: rotatedY * viewBounds.height
 )
}

If you don't mirror the X coordinate for the front camera, the skeleton is laterally flipped. If you don't handle the portrait/landscape rotation, the skeleton is transposed to the wrong axis. Both problems are invisible in unit tests because they require a rendered UI to see.


The retain cycle that leaks your session

MediaPipe's live stream detection mode uses a callback closure (poseLandmarkerLiveStreamDelegate). If you capture self strongly in that closure, you create a retain cycle that leaks the entire detection session.

// WRONG — retain cycle
class PoseDetectionService {
 var landmarker: PoseLandmarker?

 func setup() {
 let options = PoseLandmarkerOptions()
 // This captures self strongly inside MediaPipe's internal storage
 options.poseLandmarkerLiveStreamDelegate = self
 landmarker = try? PoseLandmarker(options: options)
 }

 deinit {
 // Never called — self is retained by landmarker's delegate reference
 print("PoseDetectionService deallocated")
 }
}
// CORRECT — use a wrapper to break the cycle
class WeakDelegateWrapper: PoseLandmarkerLiveStreamDelegate {
 weak var delegate: PoseLandmarkerLiveStreamDelegate?

 func poseLandmarker(
 _ poseLandmarker: PoseLandmarker,
 didFinishDetection result: PoseLandmarkerResult?,
 timestampInMilliseconds: Int,
 error: Error?
 ) {
 delegate?.poseLandmarker(poseLandmarker, didFinishDetection: result,
 timestampInMilliseconds: timestampInMilliseconds, error: error)
 }
}

// In setup:
let wrapper = WeakDelegateWrapper()
wrapper.delegate = self
options.poseLandmarkerLiveStreamDelegate = wrapper

This pattern is standard for any delegate-based API where you don't control the delegate storage. The docs don't mention it because the retention behaviour depends on how MediaPipe stores the delegate internally.


Model selection: Lite vs Full vs Heavy

Three models, three tradeoffs:

Model Latency (iPhone 12, CPU) Landmark accuracy Visibility score quality
Lite ~18ms Good Low
Full ~35ms Better Medium
Heavy ~55ms Best High

For production:

  • Lite: Finger counting demos, casual fitness apps, situations where you only need rough body position
  • Full: Most physiotherapy and fitness use cases — good accuracy at 30fps on iPhone 12+
  • Heavy: Situations where visibility scores drive logic (like the gating in our noise-handling pipeline) — the scores are significantly more reliable

We ship Full by default and let users switch to Heavy in settings if they want higher accuracy at the cost of ~5fps.


Testing in the simulator

MediaPipe doesn't work in the simulator — the GPU delegate isn't available, and the camera isn't present. Inject a video file instead:

class SimulatorVideoSource: NSObject, AVCaptureVideoDataOutputSampleBufferDelegate {
 private var displayLink: CADisplayLink?
 private var videoReader: AVAssetReader?
 private var trackOutput: AVAssetReaderTrackOutput?
 weak var delegate: AVCaptureVideoDataOutputSampleBufferDelegate?
 var queue = DispatchQueue(label: "simulator.video")

 func startPlayback(url: URL) {
 let asset = AVAsset(url: url)
 guard let track = asset.tracks(withMediaType: .video).first,
 let reader = try? AVAssetReader(asset: asset) else { return }

 let output = AVAssetReaderTrackOutput(
 track: track,
 outputSettings: [kCVPixelBufferPixelFormatTypeKey as String: kCVPixelFormatType_32BGRA]
 )
 reader.add(output)
 reader.startReading()

 videoReader = reader
 trackOutput = output

 displayLink = CADisplayLink(target: self, selector: #selector(sendNextFrame))
 displayLink?.preferredFrameRateRange = CAFrameRateRange(minimum: 30, maximum: 30, preferred: 30)
 displayLink?.add(to: .main, forMode: .default)
 }

 @objc private func sendNextFrame() {
 guard let sampleBuffer = trackOutput?.copyNextSampleBuffer() else {
 displayLink?.invalidate()
 return
 }
 delegate?.captureOutput?(AVCaptureOutput(), didOutput: sampleBuffer, from: AVCaptureConnection())
 }
}

Add a test video (recorded from a device) to your asset catalogue. In #if targetEnvironment(simulator) blocks, use SimulatorVideoSource instead of AVCaptureSession. Your whole pipeline — MediaPipe, landmark smoothing, state machine — runs end-to-end in the simulator against a known video.