In order to construct the inventory, we firstly compiled a species name dictionary by combining all of the names available in Catalogue of Life (CoL), Encyclopedia of Life (EoL) and Global Biodiversity Information Facility (GBIF). The terms contained in this dictionary were then located within the text of English BHL documents (about 24 million pages of text) using a string matching method. We then learned vector representations of the identified terms using three different approaches, namely count-based, prediction-based and compositional distributional semantic models (DSMs). These approaches compute vector representations for both single and multi-word terms. The cosine similarity between two vectors serves as an indicator of the corresponding terms' semantic relatedness: the higher the cosine similarity, the more related the two terms are. We finally selected the top-k candidates as the terms that are most semantically related to a given term.
The inventory contains 288,562 names of species whose frequency in BHL documents is at least five. For each term in the inventory, the 20 topmost semantically similar terms are provided, together with their corresponding similarity scores. For further digital biodiversity processes, each term is also linked to its URI, UUID and LSID indexed by Global Names.
A search interface that uses the inventory as metadata for query expansion is available at http://nactem.ac.uk/BHLQueryExpansion/.