Measurement of new intelligences frequently relies on empirical scoring methods such as consensus based measurement (CBM). As the correctness of response options is not established on the basis of theories, but of empirical standards, these methods have been questioned in the past to indicate the true correctness. The present studies aim to systematically investigate CBM methods as well as an enlarged pool of supposed alternative methods - Consensus Analysis, HOMALS and the Nominal Response Model (NRM) - based on data for which the true scoring keys are known. For the systematic evaluation of the empirical scoring methods, two studies were conducted, one using simulated data, the other using real-world data. In the simulation studies, several characteristics of two- and five-categorical data were manipulated to investigate the influence of the independent variables sample size, number of items, ability of respondents, and difficulty of items. The methods were evaluated with dependent variables indicating the relative distance of true and reconstructed scoring keys and the correlation between true abilities and abilities based on the respective method. The results indicate that ability and difficulty are the key influencing variables for the consensus-based scoring methods. With high ability samples and low difficulty of items these methods performed well. CBM and Consensus Analysis showed only minor differences. The NRM was mainly independent of the manipulated variables, but was only observed to work when two response options were fixed by making plausible assumptions. HOMALS was not observed to provide satisfactory results in any realized data condition. The real-world data study used responses to 18 mathematics TIMSS 2011 items (N = 15,992). Consensus-based scoring methods as well as NRM with plausible fixation worked perfectly, whereas HOMALS and NRM with random fixation did not provide the correct scoring key. Moreover, dependence on the ability was supported. It is concluded that empirical scoring methods are only recommended for specific data conditions. The results support the need for more finely elaborated theories for the measurement of intelligences.