Instructions for Estimating Core Metrics During the Early Access Program
To ensure that all participants provide consistent, objective, and comparable data, please follow the steps below when estimating each metric. These steps allow you to measure performance using clear, countable units rather than subjective impressions.
1. Measuring the Percentage of Narrative Content Requiring Modification
This metric captures how much of the AI-generated narrative needed scientific or regulatory editing.
Steps
Select three representative sections from your Module 2.4 and/or 2.6 drafts
(for example: 2.4.2, 2.6.4, 2.6.7).
In each selected section, identify narrative units.
A narrative unit is one complete scientific or regulatory statement, usually one to two sentences.
For each section:
• Count the total number of narrative units.
• Count the number of units that required modification (any scientific, interpretive, or structural change beyond minor wording cleanup).
Calculate, for each section:
Modified Units ÷ Total Units × 100
Average the percentages from the three sections.
Use this final average as your response in the questionnaire.
2. Measuring Traceability Resolution Accuracy (%)
This metric evaluates how often the traceability links within Modules 2.4/2.6 resolve correctly to the intended Module 4 source content.
Steps
Review 20–30 traceability links across your 2.4/2.6 drafts.
Include a mix of:
• narrative links,
• table-based links,
• and multi-study summary links (if applicable).
For each link, check whether it resolves to:
• the correct Module 4 study,
• the correct section within that study,
• and an excerpt that directly supports the statement in the draft.
Mark each link as Correct or Incorrect.
Calculate:
Correct Links ÷ Total Links Reviewed × 100
Use this percentage in the questionnaire.
3. Measuring QC Alert Validity (%)
This metric evaluates how many QC alerts identified meaningful issues versus how many were noise.
Steps
If your project generated 50 or fewer QC alerts, review all of them.
If more than 50, review a random sample of 40 alerts.
For each alert, check the Module 4 source and determine whether the alert:
• identified a real discrepancy or unsupported statement,
• flagged a genuine numerical or endpoint inconsistency,
• or highlighted a missing study type or coverage gap.
Classify each alert as:
• Valid – a real, meaningful issue, or
• Borderline – technically correct but minor (treat as valid), or
• Noise – false or irrelevant.
Calculate:
(Valid + Borderline Alerts) ÷ Total Alerts Reviewed × 100
Use this percentage in the questionnaire.
