\documentclass{rrxiv} \rrxivid{rrxiv:2605.00002} \rrxivversion{v3} \rrxivprotocolversion{0.1.0} \rrxivlicense{CC-BY-4.0} \rrxivtopics{cs.DL,cs.AI} \rrxivbuilddate{2026-05-25} \title{The claim graph as a first-class artifact} \author{Blaise Albis-Burdige \and Claude Opus 4.7} \date{2026-05-15} \begin{document} \maketitle \begin{center} \small\itshape Demonstration paper in the rrxiv reference corpus. The canonical machine-readable version lives at \href{https://rrxiv.com/papers/rrxiv:2605.00002}{rrxiv.com/papers/rrxiv:2605.00002}. \end{center} \begin{abstract} The paper-as-atom convention served citation but is the wrong granularity for the queries readers and agents now run: \emph{has this specific result been replicated?}, \emph{what does the literature say about this sub-question?}, \emph{which downstream work depends on this contested step?} We argue that scholarly knowledge should be addressed at the claim level, with each registered assertion a first-class node carrying a stable ID, typed evidence, and explicit dependency, support, and contradiction edges. We compare three encodings (citations-as-edges, sentences-as-edges, claims-as-nodes) on retrieval, replication, and contradiction-detection benchmarks; claims-as-nodes wins on every axis at a 3.4x annotation cost which we treat as the price of admission, not a flaw to design around. We describe the minimal protocol invariants required to make a claim graph queryable, and propose adoption alongside --- not instead of --- the citation network. \end{abstract} \section{Introduction} The scholarly record was, until recently, optimized for a single retrieval pattern: humans citing humans, one paper at a time. The paper was the indivisible unit; the citation graph was its connective tissue. This worked because the cost of authoring, distributing, and reading a paper was high enough that bundling many assertions into one document was rational, and because the only consumers of the graph were people, who could resolve ambiguity by reading. That equilibrium has broken. Modern preprint readers --- and increasingly, modern preprint \emph{agents} --- do not want to know whether a paper has been cited. They want to know whether a specific result inside it has been replicated, contradicted, or extended. They want to retrieve evidence on a narrow sub-question, not a topic. They want to know which of a paper's twelve claims a critical comment is actually about. The paper-level abstraction collapses all of this into a single yes/no node and asks the reader to manually disambiguate. The rrxiv whitepaper (\texttt{rrxiv:2605.00001}) commits the protocol to addressability below the paper level; this paper argues for the specific choice of claim-as-node, and registers the evidence supporting it. The contribution is threefold. First, a structural argument: claim-level addressability is a strict superset of paper-level addressability, so the question is not whether to adopt it but at what cost. Second, an empirical comparison of three encodings on three downstream tasks (retrieval, replication aggregation, contradiction detection); the claim-graph encoding wins on all three, but is 3.4x more expensive to produce. Third, a description of the minimum protocol commitments --- canonical claim IDs, typed edges, and a BibTeX-compatible ingest path --- required to make a claim graph queryable across instances. We do not argue the claim graph replaces the citation graph; the citation graph remains the cheap default. We argue the claim graph is a strictly more expressive overlay, and that the asymmetry between annotation cost (paid once by authors) and query benefit (paid out indefinitely to readers and agents) makes the trade worth taking. Section 2 situates the proposal against prior work. Section 3 describes the encoding and the benchmark. Section 4 registers the seven claims that constitute the result. Section 5 discusses what this changes and what it does not. \section{Background} The idea of decomposing a paper into smaller addressable units is not new. Nanopublications \citep{groth2010nano} proposed RDF-encoded assertions with provenance; the Semantic Web era produced ontologies for scientific discourse (SWAN, SPAR, CiTO) that typed citations by purpose. Argumentative zoning \citep{teufel2009towards} attempted to extract rhetorical roles from prose. More recently, scientific knowledge graphs such as ORKG and Open Research Knowledge Graph have aimed to populate structured fields from full text. These efforts share a goal but not a substrate: most assume the unit of extraction is the \emph{statement} (a sentence-level proposition) and most assume the extraction is post-hoc, performed on already-published prose. The rrxiv proposal departs on both axes. The unit is the \emph{claim} --- a coarser, author-registered assertion that the author is prepared to stand behind as a discrete result --- and the registration is part of authoring, not extraction. This matters because the failure mode of post-hoc extraction is that the graph reflects what the extractor thought the paper said, not what the author meant; the failure mode of sentence-level decomposition is graph explosion and the loss of the rhetorical structure that bundles related sentences into one defensible move. A typical rrxiv paper registers between 4 and 12 claims, not 400 sentences. This paper is also adjacent to, but distinct from, the position taken in \texttt{rrxiv:2605.00006}, which argues that citation graphs and knowledge graphs are different objects with different invariants. We agree, and inherit that distinction: the claim graph is neither. A knowledge graph asserts truths about the world; a claim graph asserts that someone, at some version, registered an assertion and its supporting evidence type. The truth value is open. This is closer to a discourse graph than a knowledge graph, and the protocol commitments reflect that --- contradiction is a legal edge, replication status is a per-claim field, and version chains are first-class. The worked example in \texttt{rrxiv:2605.00009}, which encodes Euclid's \emph{Elements} at one claim per proposition, illustrates how dense the encoding can become when the source material is itself a deductive object. \section{Approach: three encodings, three tasks} We compare three encodings of the same 200-paper corpus, drawn from the rrxiv reproducibility-first track. The corpus spans cs.LG, stat.ME, and cs.DL; papers were chosen to span empirical, theoretical, and survey types. Each paper was processed three ways. \textbf{Encoding A (citations-as-edges)} is the baseline: each paper is a node, and a directed edge exists from $p_1$ to $p_2$ if $p_1$ cites $p_2$. This is the standard scholarly graph. Edges are untyped. \textbf{Encoding B (sentences-as-edges)} decomposes each paper into sentence-level propositions via a transformer-based extractor, then links sentences across papers by lexical and semantic similarity above threshold. This is the closest analog to most prior knowledge-graph work, and serves as a sanity check that simply going below paper-level is not by itself the source of gains. \textbf{Encoding C (claims-as-nodes)} is the rrxiv encoding. Authors (or, for the 200-paper backfill, trained annotators reading on behalf of authors) registered an average of 7.2 claims per paper, each with a kind, an evidence type, and explicit \texttt{\textbackslash dependson}/\texttt{\textbackslash supports}/\texttt{\textbackslash contradicts} edges where the textual content supported them. Annotation followed a written guideline (median time per paper: 47 minutes, vs. 14 minutes for paper-level metadata only --- the 3.4x ratio registered as Claim 2). The three encodings were evaluated on three tasks. \emph{Task 1: retrieval.} A held-out set of 1{,}200 technical queries (each a single-sentence question about a narrow result, such as ``does dropout improve calibration for transformers under distribution shift?'') was run against each encoding via the same dense retriever, measuring recall@10 of the gold-labeled relevant paper-or-claim. \emph{Task 2: replication rollup.} For the 73 papers in the corpus with at least one replication attempt logged, we measured the disagreement between the paper-level replication label and the per-claim replication labels. \emph{Task 3: contradiction surfacing.} We measured how often a contradiction logged at the claim level (e.g., paper $p_2$'s Claim 3 contradicts paper $p_1$'s Claim 5) was surfaced by each encoding. Tasks 2 and 3 are not meaningful under Encoding A, which has no concept of per-claim status; we report them only for B and C. \section{Results: registered claims} \dependson{rrxiv:2605.00002:claim:c2}{rrxiv:2605.00002:claim:c1} \dependson{rrxiv:2605.00002:claim:c3}{rrxiv:2605.00002:claim:c1} \dependson{rrxiv:2605.00002:claim:c4}{rrxiv:2605.00002:claim:c1} \dependson{rrxiv:2605.00002:claim:c5}{rrxiv:2605.00002:claim:c1} \dependson{rrxiv:2605.00002:claim:c6}{rrxiv:2605.00002:claim:c1} \dependson{rrxiv:2605.00002:claim:c7}{rrxiv:2605.00002:claim:c5} % Cross-paper edges to the rrxiv whitepaper (rrxiv:2605.00001) were % considered but the whitepaper's claim labels are not yet stable % across versions, so we hold off on machine-readable depends_on/ % supports edges until label stabilisation. Prose engagement with % the whitepaper's claim-graph commitments stands in for now. \begin{claim}[Claim 1: subset relation] \label{claim:c1} Claim-level addressability is a strict superset of paper-level addressability: anything you can express by citing a paper, you can express by citing one of its claims. \emph{Replication status: untested.} \end{claim} The argument is structural, not empirical. A citation to paper $p$ is semantically equivalent to a citation to the unordered conjunction of $p$'s claims; the claim-level form additionally lets the citer pick out which claims they mean. The reverse direction does not hold: paper-level citation cannot express ``I rely on Result 3 but not on Result 7,'' which is exactly the move readers want when a paper contains a strong empirical claim alongside a weaker interpretive one. The strictness is therefore not aesthetic --- it corresponds to a real loss of information in the paper-level encoding. A subtle consequence: this is also the reason migration is cheap. An instance that publishes only paper-level metadata can be ingested by a claim-graph consumer as a degenerate case --- one synthetic claim per paper, labeled ``whole-paper assertion'' --- without breaking anything. The graph degrades gracefully; existing citation managers remain valid. We register this graceful-degradation property because it is a load-bearing argument against the ``but adoption is too hard'' objection. \begin{evidence}[Cost of registration] Annotation timings were collected over 18 annotators (PhDs in CS, biology, and economics), each annotating a stratified 50-paper subsample with 4-way overlap on a 20-paper calibration set. Median per-paper times were 47 minutes (claim-level, full edge graph), 22 minutes (claim-level, no inter-paper edges), and 14 minutes (paper-level metadata only). The 3.4x figure compares the first to the third. \end{evidence} \begin{claim}[Claim 2: annotation overhead] \label{claim:c2} Annotating claims is 3.4x more expensive than annotating papers (median, 18 annotators, 200-paper subset). \emph{Replication status: untested.} \end{claim} This is the central concession. The cost is real, it is not a one-time tax (each new version requires re-annotation of the diff), and it falls disproportionately on authors. We do not claim the cost is small. We claim it is justified because (a) it is paid once per paper-version, while query benefits accrue indefinitely; (b) most of the cost is in declaring edges, which an extractor-assisted tool can pre-populate; and (c) for the highest-value queries --- has this been replicated, does anyone contradict this --- there is no cheaper substitute that returns the right answer. The reproducibility-budget framework in \texttt{rrxiv:2605.00003} provides a complementary lens: if reproducibility is a budgetable cost, claim-level annotation is the first line item. \begin{claim}[Claim 3: retrieval gain] \label{claim:c3} Claim-graph retrieval improves recall@10 by 28\% over citation-graph retrieval on narrow technical queries (n=1,200 queries). \emph{Replication status: untested.} \end{claim} Recall@10 rose from 0.51 (Encoding A) to 0.65 (Encoding C); Encoding B sat in between at 0.58. The gap between B and C is the relevant signal: simply going below paper-level (B) recovers about half the benefit, but the rhetorical bundling that authors do at the claim level (C) recovers the rest. Examining the error modes, Encoding B fails on queries where the answer requires a claim composed across two or three sentences (``does X improve under Y given Z?''), because the sentence-level decomposition fractured the proposition into pieces that each individually look low-relevance. Encoding C keeps the claim intact, which is what the query was actually asking about. We expect the gap to widen for queries posed by agents rather than humans, who tend to issue narrower and more compositional questions; that hypothesis is not yet tested. \begin{claim}[Claim 4: replication masking] \label{claim:c4} Paper-level replication labels mask within-paper disagreement: in our sample, 41\% of ``replicated'' papers had at least one contradicted claim. \emph{Replication status: replicated.} \end{claim} This is the only claim in this paper with replication status \emph{replicated}, and it carries the most weight for the argument. Of 73 papers in our corpus with a positive paper-level replication label, 30 contained at least one claim that a downstream paper had explicitly contradicted at the claim level. Without claim-level addressability, those contradictions are not surfaced --- they live inside the citing paper's prose, where a paper-level rollup cannot reach them. The paper-level label is not wrong; it is averaging over a population (the paper's claims) that has internal disagreement. This is the same kind of error as reporting a treatment as ``effective'' when only the primary endpoint was met and a secondary endpoint moved in the wrong direction. The replication of this claim itself was performed independently in \texttt{rrxiv:2605.00008}, which extends it to a larger active-replication corpus and reports a comparable 38\% figure. \begin{claim}[Claim 5: stable claim IDs] \label{claim:c5} A canonical claim ID format of \texttt{::