To evaluate the performance of diagnostic prediction models for ovarian malignancy in all patients with an ovarian mass managed surgically or conservatively.Multicentre cohort study.36 oncology referral centres (tertiary centres with a specific gynaecological oncology unit) or other types of centre.Consecutive adult patients presenting with an adnexal mass between January 2012 and March 2015 and managed by surgery or follow-up.Overall and centre specific discrimination, calibration, and clinical utility of six prediction models for ovarian malignancy (risk of malignancy index (RMI), logistic regression model 2 (LR2), simple rules, simple rules risk model (SRRisk), assessment of different neoplasias in the adnexa (ADNEX) with or without CA125). ADNEX allows the risk of malignancy to be subdivided into risks of a borderline, stage I primary, stage II-IV primary, or secondary metastatic malignancy. The outcome was based on histology if patients underwent surgery, or on results of clinical and ultrasound follow-up at 12 (±2) months. Multiple imputation was used when outcome based on follow-up was uncertain.The primary analysis included 17 centres that met strict quality criteria for surgical and follow-up data (5717 of all 8519 patients). 812 patients (14%) had a mass that was already in follow-up at study recruitment, therefore 4905 patients were included in the statistical analysis. The outcome was benign in 3441 (70%) patients and malignant in 978 (20%). Uncertain outcomes (486, 10%) were most often explained by limited follow-up information. The overall area under the receiver operating characteristic curve was highest for ADNEX with CA125 (0.94, 95% confidence interval 0.92 to 0.96), ADNEX without CA125 (0.94, 0.91 to 0.95) and SRRisk (0.94, 0.91 to 0.95), and lowest for RMI (0.89, 0.85 to 0.92). Calibration varied among centres for all models, however the ADNEX models and SRRisk were the best calibrated. Calibration of the estimated risks for the tumour subtypes was good for ADNEX irrespective of whether or not CA125 was included as a predictor. Overall clinical utility (net benefit) was highest for the ADNEX models and SRRisk, and lowest for RMI. For patients who received at least one follow-up scan (n=1958), overall area under the receiver operating characteristic curve ranged from 0.76 (95% confidence interval 0.66 to 0.84) for RMI to 0.89 (0.81 to 0.94) for ADNEX with CA125.Our study found the ADNEX models and SRRisk are the best models to distinguish between benign and malignant masses in all patients presenting with an adnexal mass, including those managed conservatively.ClinicalTrials.gov NCT01698632.