dataset
The Pomona MDP-R-50K Dataset is a set of approximately 50,000 documents that have been annotated with rich annotations for multiple elements. The annotation includes both text and graphical features: rich linguistic features such as Part-of-Speech tagging, Named Entity Recognition (NER), Semantic Role Labelling (SRL), and argument role labelling; graphical elements such as boxes and arrows, tables, numbered lists and other graphical elements; images and videos; and hyperlinks, audio and meta-data. The annotations are based on the Text Encoding Initiative (TEI) standards so they are machine-readable and can be used for automatic annotation and mining. The documents in the dataset are from the MDP (Publications from the Modern Democratic Party of Japan) database, a major source of elite political discourse in Japan.