Automatic Metadata Extraction - The High Energy Physics Use Case

Boyd, Joseph

Published July 27, 2015 | Version v1

Thesis Open

Automatic Metadata Extraction - The High Energy Physics Use Case

Boyd, Joseph¹

1. Ecole Polytechnique Lausanne

Contributors

Supervisors:

1. European Organization for Nuclear Research
2. Ecole Polytechnique Lausanne

Automatic metadata extraction (AME) of scientific papers has been described as one of the hardest problems in document engineering. Heterogeneous content, varying style, and unpredictable placement of article components render the problem inherently indeterministic. Conditional random fields (CRF), a machine learning technique, can be used to classify document metadata amidst this uncertainty, annotating document contents with semantic labels. High energy physics (HEP) papers, such as those written at CERN, have unique content and structural characteristics, with scientific collaborations of thousands of authors altering article layouts dramatically. The distinctive qualities of these papers necessitate the creation of specialised datasets and model features. In this work we build an unprecedented training set of HEP papers and propose and evaluate a set of innovative features for CRF models. We build upon state-of-the-art AME software, GROBID, a tool coordinating a hierarchy of CRF models in a full document cascade. Through our extensions and our own robust experimentation pipeline, we cross-validate 66 experiment variations to find new improvements in feature engineering. We succeed in enhancing the two most crucial CRF models within the cascade, reducing error by up to 25% for key classifications.

Files

CERN-THESIS-2015-105.pdf

Files (2.6 MB)

Name	Size	Download all
CERN-THESIS-2015-105.pdf md5:abb375b41023733d9c102191376f7a65	2.6 MB	Preview Download

Additional details

CDS: 2039361
CDS Report Number: CERN-THESIS-2015-105

Department: GS
Programme: No program participation
Studies: Not applicable

	All versions	This version
Views	1,029	1,029
Downloads	1,986	1,986
Data volume	7.0 GB	7.0 GB

Automatic Metadata Extraction - The High Energy Physics Use Case

Authors/Creators

Contributors

Supervisors:

Description

Files

CERN-THESIS-2015-105.pdf

Files (2.6 MB)

Additional details

Identifiers

CERN

Linked records