Learning and Evaluation of Topics via Distributional Semantics
Abstract: Written language is a means of communication. It not only shapes our thoughts, written language also helps us communicate information. As the amount of digital text available keeps growing, it becomes increasingly difficult to locate and keep track of specific information of interest. This observation has fuelled the search for sophisticated representations of written text, and methods for learning meaning. In particular, topic identification has grown in importance in recent years as an approach to summarise, organise and understand text. Underpinning modern topic identification methods is the framework of distributional semantics which is based on the assumption that meaning is associated with use, and in particular, meaning can be learned by examining the contexts in which words occurs. Motivated by this, we look in this thesis at the broad field of topic identification in text learned via state-of-the-art distributional semantics models. As such, we provide new answers to the complex question of how meaning is used to derive abstract concepts like topics, and how non-expert humans evaluate such abstract concept generated from artificial processes. In more detail, we address three key problems. We first tackle the problem of evaluating the output of topic models (a particular kind of topic identification method) on large text corpora by leveraging non-expert annotators to assess the relevance of topics to a set of documents. Second, we develop a new method to assist in the interpretation of topics by providing additional context. In particular, our solution learns topics as collections of sentences extracted from large corpus of unstructured documents. Finally, we identify and track the topic of text collected over time. In particular, we look at text-based dialogues which often consists of short utterances covering a variety of topics.
Authors: A. Augustin
Date: 2020-03-03
Venue: University of Southampton Institutional Repository
Repository: #
#publications #research #machinelearning