Published July 27, 2015 | Version v1
Thesis Open

Automatic Metadata Extraction - The High Energy Physics Use Case

Authors/Creators

  • 1. Ecole Polytechnique Lausanne

Contributors

  • 1. ROR icon European Organization for Nuclear Research
  • 2. Ecole Polytechnique Lausanne

Description

Automatic metadata extraction (AME) of scientific papers has been described as one of the hardest problems in document engineering. Heterogeneous content, varying style, and unpredictable placement of article components render the problem inherently indeterministic. Conditional random fields (CRF), a machine learning technique, can be used to classify document metadata amidst this uncertainty, annotating document contents with semantic labels. High energy physics (HEP) papers, such as those written at CERN, have unique content and structural characteristics, with scientific collaborations of thousands of authors altering article layouts dramatically. The distinctive qualities of these papers necessitate the creation of specialised datasets and model features. In this work we build an unprecedented training set of HEP papers and propose and evaluate a set of innovative features for CRF models. We build upon state-of-the-art AME software, GROBID, a tool coordinating a hierarchy of CRF models in a full document cascade. Through our extensions and our own robust experimentation pipeline, we cross-validate 66 experiment variations to find new improvements in feature engineering. We succeed in enhancing the two most crucial CRF models within the cascade, reducing error by up to 25% for key classifications.

Files

CERN-THESIS-2015-105.pdf

Files (2.6 MB)

Name Size Download all
md5:abb375b41023733d9c102191376f7a65
2.6 MB Preview Download

Additional details

Identifiers

CDS
2039361
CDS Report Number
CERN-THESIS-2015-105

CERN

Department
GS
Programme
No program participation
Studies
Not applicable

Linked records