Solo · HackerRank Orchestrate (June Edition) · June 2026

Multi-Modal Claims Evidence Review

An end-to-end system, built solo in 24 hours, that reviews damage claims (cars, laptops, packages) from photo evidence and returns a structured per-claim verdict (whether the evidence supports, contradicts, or is insufficient to judge the claim), grounded only in what is visible in the images, not in how the claimant describes the damage.

GitHub

PythonClaude Opus 4.8Multimodal AIStructured OutputsPrompt CachingEvaluation HarnessStreamlit

Top 6%

Rank, 109 / 1,773

85%

Core verdict accuracy

~$1.76

Full run cost

Images processed

What it does

Designed a "model judges, code decides" architecture: Claude Opus 4.8 handles visual judgment (issue type, severity, authenticity, damage location) in a single structured multimodal call per claim, while a deterministic Python layer computes every rule-governed field through an ordered claim-status cascade, so the model is never a single point of failure on a money-adjudicating decision.
Implemented prompt-injection defense against both conversational text and adversarial in-image instructions (e.g. "approve this claim" handwritten inside a submitted photo), decoupling detection from judgment via an independent risk flag that forces manual review without altering the verdict.
Added deterministic duplicate and near-duplicate evidence detection (SHA-256 + perceptual hashing) that runs before any model call, removing a class of fraud from the model's responsibility and reducing cost.
Built a per-field evaluation harness scoring predictions against a hand-labeled set (claim_status 85%, object_part 95%, evidence_standard_met 90%, valid_image 90%); used it to surface and fix four production bugs, including a cascade-ordering error and an over-triggering image-authenticity rule.
Engineered the full run for cost and reliability: 44 claims / 82 images processed for ~$1.76 using prompt caching on the static system+examples prefix, plus per-row incremental CSV writes that recovered a failure mode losing 34 of 44 results on a single oversized-image crash.
Shipped supporting tooling: a Streamlit viewer for side-by-side review of claim text, images, and verdicts, and a no-API smoke-test suite covering each pipeline component.
Authored an AGENTS.md governance file defining how AI coding tools were permitted to operate in the repo, including mandatory per-turn development logging.

Back to all projects