A Brief History of Human Time
Since the time Plutarch’s Parallel Lives was written in the beginning of the second century AD and his 23 biographies have survived two thousand years, the task of registering famous individuals and their influence has been a recurrent field of study. Over the last few years, this task has been undertaken to a much larger scale, with a growing number of databases documenting history, allowing statistical analysis of socio-historical facts, at a scale that had never been reached so far.
Our most recent paper develops a cross-verified database of 2.2 million notable individuals using several Wikipedia editions and Wikidata.
Our approach complements existing approaches in several ways. First, we collect a massive amount of data that leads to several cross-verifications. It is based on multiple sources (various editions of Wikipedia and Wikidata) and deduplication techniques. The combination of Wikipedia and Wikidata brings 2.72% new birth dates, 8.16% new occupations and 17.16% new citizenships. We find that there are very few errors in the part of the database that contains the most documented individuals. We also find non trivial error rates (around 1%) in the bottom part of the notability distribution, due to sparse information and classification errors or ambiguity. This either requires manual corrections for future use or a statistical treatment of these errors in statistical approaches. The combination of Wikipedia and Wikidata corrects about 0.5% of errors. One therefore needs to trade-off the size of the database and the precision of the data.
Second, we adopt a social science approach: data collection is driven by specific social questions on gender, economic and cultural development and quantitative exploration of cultural trends that we document in the paper. This approach is used in particular to document the Anglo-Saxon bias naturally present in existing projects based on the English edition of Wikipedia.
This strategy resulted in the production of a cross-checked database of 2.2 million unique individuals. We do not recommend going beyond this: we found errors in the extended database of 4.7 million individuals. We also take into account a large proportion of the newly added individuals in the non-English versions of Wikipedia who actually played an important role in important periods of human history. There are more than 700,000 such individuals, almost a third of the database we have verified.