The problem

Resume screening is expensive and noisy: a bad hire costs tens of thousands, while interviews and phone screens add up. Single-LLM classifiers are opaque and can encode bias. Teams need a way to combine multiple models, incorporate costs (missed hire vs. interview cost), and update decisions as new evidence arrives—without treating the LLM as a single point of failure.

The solution

This project implements a Bayesian orchestration framework for multi-LLM resume screening. Instead of using one LLM as a classifier, we elicit likelihoods from several LLMs (e.g. GPT-4o, Claude, Gemini, Grok, DeepSeek) via contrastive prompting, aggregate them with robust statistics, and update beliefs with Bayes’ rule under explicit priors. Decisions are made by expected cost (e.g. cost of missed hire vs. cost of interview), so the system is both cost-aware and auditable.

Without Bayesian orchestration

Single-model bias, no explicit cost tradeoffs, and no principled way to combine or update beliefs as resumes are processed in sequence.

With Bayesian Hiring Agent

Multiple LLMs as likelihood sources; coherent belief updating; expected-cost action selection; reported ~34% cost reduction and ~45% improvement in demographic parity in experiments.

What it does

  • Contrastive prompting – Elicit likelihoods from LLMs (e.g. “fit vs. not fit”) instead of hard labels.
  • Multi-model aggregation – Combine predictions across GPT-4o, Claude, Gemini, Grok, DeepSeek with robust statistics.
  • Bayesian updating – Prior + likelihood → posterior; update as each resume is processed (sequential decision-making).
  • Cost-aware actions – Choose screen / interview / reject by minimizing expected cost (e.g. $40k missed hire, $2.5k interview, $150 phone screen).
  • Fairness – In reported experiments, improves demographic parity (45%) while reducing total cost (34%) vs. best single-LLM baseline.

Tech & research

Implementation and experiments for the framework described in “Bayesian Orchestration of Multi-LLM Agents for Cost-Aware Sequential Decision-Making.” The repo contains code for elicitation, aggregation, belief updating, and evaluation on resume-screening style setups.

Next steps

Roadmap & ideas

  • Public demo or API (optionally password-protected from this page) for recruiters or researchers.
  • Support for more LLMs and prompt templates; ablation on contrastive vs. direct prompting.
  • Calibration and fairness audits on new datasets and domains.
  • Integration with ATS or HR workflows (e.g. webhook, CSV in/out) for pilot use.
  • Publication or extended write-up with full experimental details and reproducibility.