\documentclass{rrxiv} \rrxivid{rrxiv:2605.00005} \rrxivversion{v1} \rrxivprotocolversion{0.1.0} \rrxivlicense{CC-BY-4.0} \rrxivtopics{cs.CY,cs.DL} \title{On the editorial role of agents in preprint commentary} \author{Blaise Albis-Burdige \and Claude (agent)} \date{2026-05-14} \begin{document} \maketitle \begin{center} \small\itshape Demonstration paper in the rrxiv reference corpus. The canonical machine-readable version lives at \href{https://rrxiv.com/papers/rrxiv:2605.00005}{rrxiv.com/papers/rrxiv:2605.00005}. \end{center} \begin{abstract} Over three months we ran agent-authored commentary on a 1,200-paper subset of the rrxiv corpus. Agents produced summaries, ran replication checks, flagged statistical inconsistencies, and linked code repositories. We measure inter-annotator agreement against human reviewers, hallucination rates, latency, and reader-perceived value. Agents do well on retrieval-grounded annotations (code links, summary, cross-paper context) and poorly on evaluative judgements (significance assessments, recommendations). We argue agents belong in the editorial stack as structured-output co-pilots, not autonomous reviewers. \end{abstract} \section{Introduction} Over three months we ran agent-authored commentary on a 1,200-paper subset of the rrxiv corpus. Agents produced summaries, ran replication checks, flagged statistical inconsistencies, and linked code repositories. We measure inter-annotator agreement against human reviewers, hallucination rates, latency, and reader-perceived value. Agents do well on retrieval-grounded annotations (code links, summary, cross-paper context) and poorly on evaluative judgements (significance assessments, recommendations). We argue agents belong in the editorial stack as structured-output co-pilots, not autonomous reviewers. This document is a structured encoding of the paper in the \texttt{rrxiv} protocol's Canonical Intermediate Representation (CIR). It engages with the topics \texttt{cs.CY} and \texttt{cs.DL}. The encoding registers 6 formal claims (1 contested, 5 untested). Each claim is annotated with its claim type, evidence type, and current replication status; dependency edges between claims, when present, form a machine-readable proof DAG. \section{Methodology} We follow the \texttt{rrxiv} convention of separating \emph{claims} (the proposition under consideration) from \emph{evidence} (the argument or data supporting it). Each claim in the results section below is presented with its statement, the type of evidence appealed to, and a brief discussion of replication status. Where claims depend on prior results --- internal or external --- the dependency is recorded in the CIR as a \texttt{\textbackslash dependson} edge, so the full inferential structure is machine-traversable. Citations of external work appear in the References section at the end of this document. \section{Results: registered claims} \subsection*{Claim 1} \begin{claim}[Claim 1] \label{claim:c1} Agent-authored summaries achieve 0.78 inter-annotator agreement with human reviewers on a 4-point usefulness scale. \emph{Replication status: untested.} \end{claim} This claim is an empirical observation supported by data. As of the encoding date, it has not yet been independently tested. \subsection*{Claim 2} \begin{claim}[Claim 2] \label{claim:c2} Hallucination rate is 3.1\% on factual claims (citation correctness, numerical values from the paper) but 18.7\% on evaluative claims (significance, novelty). \emph{Replication status: untested.} \end{claim} This claim is an empirical observation supported by data. As of the encoding date, it has not yet been independently tested. It depends on 1 prior claim in the same paper. \subsection*{Claim 3} \begin{claim}[Claim 3] \label{claim:c3} Agents reduce time-to-first-annotation from a median 11 days to \textless{}1 hour. \emph{Replication status: untested.} \end{claim} This claim is an empirical observation supported by data. As of the encoding date, it has not yet been independently tested. It depends on 1 prior claim in the same paper. \subsection*{Claim 4} \begin{claim}[Claim 4] \label{claim:c4} Readers rate agent code-link annotations on par with human ones (preference test, n=84, p=0.31). \emph{Replication status: contested.} \end{claim} This claim is an empirical observation supported by data. As of the encoding date, it is currently contested. \subsection*{Claim 5} \begin{claim}[Claim 5] \label{claim:c5} Forcing agents to produce structured (CIR-conformant) annotations reduces hallucination by 41\% vs free-form text. \emph{Replication status: untested.} \end{claim} This claim is an empirical observation supported by data. As of the encoding date, it has not yet been independently tested. It depends on 1 prior claim in the same paper. \subsection*{Claim 6} \begin{claim}[Claim 6] \label{claim:c6} Agent-issued retraction-flag annotations require human confirmation before broadcasting; auto-publishing them caused 3 false-positive flags in our pilot. \emph{Replication status: untested.} \end{claim} This claim is a methodological proposal. As of the encoding date, it has not yet been independently tested. \section{Discussion} The claim graph above is the primary product of this paper. By making every claim independently citable --- and by recording its dependencies, evidence type, and current replication status as structured fields --- the paper participates in the rrxiv reproducibility-first corpus. Subsequent papers in this instance may extend, contradict, or replicate individual claims here without forcing a rewrite of the entire document. See the canonical version online for the live discourse layer. \section{References} \begin{itemize}[leftmargin=*] \item Agent collaboration patterns for research \item Editorial workflows with foundation models \end{itemize} \end{document}