Simulation models represent soil organic carbon (SOC) dynamics in global carbon (C) cycle scenarios to support climate-change studies. It is imperative to increase confidence in long-term predictions of SOC dynamics by reducing the uncertainty in model estimates. To do this, we evaluated SOC simulated from an ensemble of 26 process‐based C models by comparing simulations to experimental data from seven long-term bare-fallow (vegetation-free) plots at six sites in Denmark (two sites), France, Russia, Sweden and the United Kingdom. The decay of SOC in these plots has been monitored for decades since the last inputs of plant material, providing the opportunity to test decomposition without the continuous input of new organic material. The models were run independently over multi-year simulation periods (from 28 to 80 years) in a blind test with no calibration (Bln) and with three calibration scenarios, each providing different levels of information and/or allowing different levels of model fitting: a) calibrating decomposition parameters separately at each experimental site (Spe); b) using a generic, knowledge-based, parameterisation applicable in the Central European region (Gen); and c) using a combination of both a) and b) strategies (Mix). With this methodology, we addressed uncertainties from different modelling approaches with or without spin-up initialisation of SOC. Changes in the multi-model median (MMM) of SOC were used as descriptors of the ensemble performance. On average across sites, Gen proved adequate in describing changes in SOC, with MMM equal to average SOC (and standard deviation) of 39.2 (±15.5) Mg C ha-1 compared to the observed mean of 36.0 (±19.7) Mg C ha-1 (last observed year), indicating sufficiently reliable SOC estimates. This is important because moving to Mix (37.5±16.7 Mg C ha-1) and Spe (36.8±19.8 Mg C ha-1) provided only marginal gains in accuracy, but with these scenarios modellers would need to apply increasingly more knowledge and a greater calibration effort than in Gen, thereby limiting the wider applicability of models.