Defensibility Index methodology

How the ArrowISE Defensibility Index is computed from arrangement state and physician status. Plain-English equivalent of the computeDefensibilityIndex function.

v1.0 · reviewed 2026-05-06

Why we don’t ask an LLM to interpret Stark Law

The Defensibility Index is computed by a deterministic rule engine. Given the same arrangement inputs evaluated against the same regulatory rules, the engine produces the same score with the same audit trail every time. This is the architectural choice that defines what the rest of this methodology rests on, and it deserves to be stated explicitly rather than buried in implementation notes.

The architectural choice

There are two general patterns available for compliance software in 2026. The first is large language model (LLM) document analysis: upload an agreement, ask the model to assess it against Stark Law or AKS criteria, receive a generated answer with citations and a confidence score. The second is deterministic rule engine validation: encode the regulatory elements as structured rules, validate arrangement data against each element, produce a pass / fail / indeterminate result for every element with a reproducible audit trail.

ArrowISE chose the second pattern. The reasons are specific and they matter for the compliance officer staking personal professional liability on the documentation that survives DOJ or OIG review.

Where we use LLM technology (bounded scope)

ArrowISE uses LLM technology in exactly one place: extracting the relevant text fields from third-party FMV opinion PDFs at the point of upload. Compensation amount, opinion date, methodology summary, opining party, arrangement scope — these are well-defined fields with limited interpretation latitude, and an LLM is good at locating them inside a free-form PDF. The extracted values are then written into structured database fields where the rule engine evaluates them.

The LLM does not decide whether the arrangement is compliant. It does not interpret Stark Law exceptions. It does not generate the Defensibility Index score. It does not produce regulatory analysis. It extracts text from PDFs. That is the entire scope of LLM usage in the platform.

Where we don’t use LLM technology

The rule engine that validates each safe harbor and exception element is deterministic Python. The Defensibility Index scoring is deterministic Python. The hash-chained audit trail is deterministic cryptography. The enforcement-pattern weights derived from DOJ prosecution history are static coefficients reviewed and updated through a documented version process, not generated answers.

This means that the same arrangement evaluated today produces the same Defensibility Index as the same arrangement evaluated three years from today, assuming the regulatory rule version hasn’t changed. If the rules change, the version bump is documented and the prior version remains available for retrospective evaluation. There is no model drift, no non-determinism, no “the AI gave a different answer this time” failure mode.

Why this matters for CIA-period scrutiny

A Corporate Integrity Agreement typically runs five years. During that period, the institution’s arrangement documentation may be reviewed by an Independent Review Organization, by the Office of Inspector General, and potentially in litigation discovery. The defensibility argument the compliance officer needs to make is not “our AI assessed this as compliant.” The defensibility argument is “the deterministic rule engine validated each of the seven personal services safe harbor elements, here is the per-element evidence, here is the hash-chained audit log proving the evaluation occurred on this date with these inputs.”

Those two arguments are different in kind. An LLM-generated assessment is a statement about what the model produced given an input at a moment in time. A deterministic rule engine evaluation is a statement about whether the arrangement, as documented, satisfies the elements required by the relevant safe harbor or exception. The latter is what the regulation asks. The former is one vendor’s synthesis of what they think the regulation asks.

Over a five-year CIA period, reproducibility is the central asset. The deterministic architecture is what makes reproducibility possible.

A note on AI and compliance

This is not anti-AI positioning. AI has real and valuable applications in compliance: document classification, anomaly detection in claims data, natural-language search across policy libraries, summarization of long enforcement actions for board reporting. ArrowISE uses LLM technology where it’s genuinely useful and bounded appropriately. The discipline is in the scope, not in the rejection.

What ArrowISE will not do is ask an LLM to make the call on whether an arrangement satisfies Stark Law on a billion-dollar liability question. That call belongs to the compliance officer and legal counsel, supported by deterministic infrastructure that produces the documentation those professionals need to defend the institution. The platform supports the decision. The professionals make it.

What the Defensibility Index measures

The Defensibility Index (DI) is a 0–100 score representing how well an arrangement would hold up under enforcement scrutiny. Higher scores mean the arrangement has stronger documentation, fresher third-party validation, and lower pattern-match against historical Stark / Anti-Kickback Statute enforcement actions.

DI is not a legal opinion. It is a structured, deterministic aggregation of five evidence dimensions weighted to mirror what enforcement actions actually turn on. The score is a starting point for compliance review — not the conclusion of one. All decisions must be reviewed by qualified healthcare counsel.

The five weights

Component	Weight	Why this weight
FMV Currency	30%	Opinion freshness is the most-litigated dimension; stale FMV opinions are the single largest driver of Stark False Claims Act recoveries.
Safe-Harbor Completeness	25%	Element gaps under the applicable Stark exception or AKS safe harbor are direct enforcement signals; partial compliance is not compliance.
OIG Exclusion Status	25%	OIG exclusion of a referring physician is an immediate disqualifier; failure to screen is itself a finding.
External Assessment Currency	10%	Independent third-party review (Cognitron and similar) demonstrates that the arrangement was inspected by someone outside the contracting parties.
Schena-Shield Score	10%	Pattern-match against historical enforcement actions (see /enforcement — coming soon). Inverted: high-pattern-match risk reduces defensibility.

Weights sum to 100%. Sub-scores are 0–100. Each component's contribution is sub-score × weight; the DI is the sum, rounded to 0.1.

Sub-score calculations

FMV Currency (0–100)

Based on days until the FMV opinion expires:

days > 90 → 100 days 31–90 → linear from 100 (at 91) down to 60 (at 31) days 1–30 → linear from 60 (at 30) down to 30 (at 1) expired or missing → 0

Safe-Harbor Completeness (0–100)

Direct passthrough of the percentage of safe-harbor elements marked "met" for the arrangement's applicable Stark exception or AKS safe harbor. Computed in lib/services/safe-harbor.ts. An arrangement with 5 of 7 elements satisfied has a sub-score of 71.4.

OIG Exclusion Status (0–100)

screened & clear → 100 screened & pending → 50 screened & match → 0 not screened → 0

External Assessment Currency (0–100)

no assessments on file → 0 any assessment status = current → 100 any assessment status = expiring → 60 all assessments expired/unavailable → 20

Schena-Shield Score (0–100)

Schena-Shield is a 0–100 risk pattern-match score where higher = more risk. For DI purposes the score is inverted: defensibility = 100 − risk. A Schena-Shield score of 25 contributes a sub-score of 75.

Worked example

Arrangement state:

FMV opinion expires in 60 days
5 of 7 safe-harbor elements met (71.4% completeness)
OIG screening result: clear
No external assessment on file
Schena-Shield risk score: 25

Sub-scores:

FMV Currency: 60 + ((60 − 30) / 60) × 40 = 80
Safe-Harbor Completeness: 71.4
OIG Exclusion Status: 100
External Assessment Currency: 0
Schena-Shield: 100 − 25 = 75

Weighted contributions:

FMV: 80 × 0.30 = 24.00
Safe Harbor: 71.4 × 0.25 = 17.85
OIG: 100 × 0.25 = 25.00
External: 0 × 0.10 = 0.00
Schena-Shield: 75 × 0.10 = 7.50

DI = 74.4

Why the score is interpretable, not authoritative

Two arrangements with identical DI scores can have different underlying risk profiles. The score's structure is its value: a DI of 60 with a 0 in OIG is structurally different from a DI of 60 with a 0 in External Assessment Currency. The arrangement Risk tab surfaces each component's contribution so the reviewer sees which dimensions are dragging the score down.

The DI is designed to make compliance gaps visible at-a-glance, not to substitute for legal judgment. A score of 95 does not immunize an arrangement from enforcement; a score of 50 is not automatic disqualification. Use the DI as a triage signal: which arrangements need attention this week, which are routine.

Change control

Weights are board-locked. Any change to the weights, the sub-score formulas, or the rounding behavior requires written sign-off from the Chief Compliance Officer of ArrowISE (Ectropy Solutions, LLC). The defensibility-version.test.ts snapshot test will fail in CI when the weights change without this page's defensibility-version meta tag being bumped — the document and the code stay in lockstep.

The current version is v1.0, set May 2026 at the launch of the closed-pilot program. The next planned review is Q4 2026, aligned with the SOC 2 Type I auditor selection.

Source

Production source is in lib/services/defensibility.ts. Unit-test coverage in __tests__/defensibility.test.ts pins each sub-score function and the composite output.