Published August 28, 2014 | Version v1
Technical note Open

Author Clustering on Large Bibliographies

  • 1. ROR icon European Organization for Nuclear Research

Contributors

Description

We analyze and design an algorithm for clustering large sets of authors in Bibliographies. Not considering a distance function for a mutual comparison, but transforming the data into a multidimensional metric space, the algorithm described is similar to locally sensitive hashing. The task lies in the field of Record-Linkage. The algorithm was designed and performed based on the data of the CERN Document Server, consisting out of more than 1.7 million metadata entries and is part of the digital assets-managing-software invenio. Meant as a prototype, the algorithm performs efficiently, clustering all authors on CDS in under 30 minutes. We will discuss extensions improving the recall rate, wich still remains inferior to the currently used clustering-approach.

Files

paper.pdf

Files (333.6 kB)

Name Size Download all
md5:b0fb1ded02adbca49eb738ad31914acb
333.6 kB Preview Download

Additional details

Identifiers

CDS Reference
CERN-STUDENTS-Note-2014-128

CERN

Department
IT