Identical values can yield opposite conclusions when the design changes from one-sample to two-sample to paired. That’s not statistics being fickle… it’s statistics being literal. The tool answers exactly the question you ask. So build an audit trail: assignment method, who did which condition, counterbalancing, timestamps, preprocessing steps. Block mistakes with human labels, not just “Group A/B.” Create a pre-analysis oath (checks you refuse to skip) and enforce it before any p value is read aloud. When conclusions hinge on a single relabel, process is your truth serum.