The problem
Resume screening is expensive and noisy: a bad hire costs tens of thousands, while interviews and phone screens add up. Single-LLM classifiers are opaque and can encode bias. Teams need a way to combine multiple models, incorporate costs (missed hire vs. interview cost), and update decisions as new evidence arrives—without treating the LLM as a single point of failure.
The solution
This project implements a Bayesian orchestration framework for multi-LLM resume screening. Instead of using one LLM as a classifier, we elicit likelihoods from several LLMs (e.g. GPT-4o, Claude, Gemini, Grok, DeepSeek) via contrastive prompting, aggregate them with robust statistics, and update beliefs with Bayes’ rule under explicit priors. Decisions are made by expected cost (e.g. cost of missed hire vs. cost of interview), so the system is both cost-aware and auditable.
Without Bayesian orchestration
Single-model bias, no explicit cost tradeoffs, and no principled way to combine or update beliefs as resumes are processed in sequence.
With Bayesian Hiring Agent
Multiple LLMs as likelihood sources; coherent belief updating; expected-cost action selection; reported ~34% cost reduction and ~45% improvement in demographic parity in experiments.
What it does
- Contrastive prompting – Elicit likelihoods from LLMs (e.g. “fit vs. not fit”) instead of hard labels.
- Multi-model aggregation – Combine predictions across GPT-4o, Claude, Gemini, Grok, DeepSeek with robust statistics.
- Bayesian updating – Prior + likelihood → posterior; update as each resume is processed (sequential decision-making).
- Cost-aware actions – Choose screen / interview / reject by minimizing expected cost (e.g. $40k missed hire, $2.5k interview, $150 phone screen).
- Fairness – In reported experiments, improves demographic parity (45%) while reducing total cost (34%) vs. best single-LLM baseline.
Tech & research
Implementation and experiments for the framework described in “Bayesian Orchestration of Multi-LLM Agents for Cost-Aware Sequential Decision-Making.” The repo contains code for elicitation, aggregation, belief updating, and evaluation on resume-screening style setups.