{"id":10,"date":"2017-08-25T13:44:59","date_gmt":"2017-08-25T17:44:59","guid":{"rendered":"http:\/\/blogs.harvard.edu\/geeks\/?p=10"},"modified":"2017-08-25T13:44:59","modified_gmt":"2017-08-25T17:44:59","slug":"some-thoughts-about-topic-modelling","status":"publish","type":"post","link":"https:\/\/archive.blogs.harvard.edu\/geeks\/2017\/08\/25\/some-thoughts-about-topic-modelling\/","title":{"rendered":"Some thoughts about topic modelling"},"content":{"rendered":"<p>During<span style=\"font-family: 'times new roman', times, serif\"> the Google Summer of Code program of this year (2017), I am working on a project about topic modelling. The intention of this project is to identify the topics of each news article so that they can be grouped for further researches or analysis. I write this post to share some thoughts of mine and briefly discuss the program I implemented. I have also\u00a0created some slides for this discussion at <a href=\"http:\/\/prezi.com\/6wuou_phlgy9\/?utm_campaign=share&amp;utm_medium=copy&amp;rc=ex0share\">Prezi<\/a>.<\/span><\/p>\n<p><span style=\"font-family: 'times new roman', times, serif\">I would like to divide my implementation into four steps, \u00a0<em>tokenize data<\/em>, <em>remove\u00a0stop-words<\/em>, <em>apply\u00a0machine learning\u00a0algorithms<\/em> and <em>tune parameters<\/em>.<\/span><\/p>\n<h2><span style=\"font-family: 'times new roman', times, serif\">Tokenizing data<\/span><\/h2>\n<p><span style=\"font-family: 'times new roman', times, serif\">The data I am using is fetched from the MediaCloud database; each datum consists of two attributes: sentence, and\u00a0ariticle_id. Attribute sentence contains a plain text string of one sentence from the article, while article_id\u00a0represents to which article this sentence belongs.<\/span><\/p>\n<p><span style=\"font-family: 'times new roman', times, serif\">In general, tokenization\u00a0means breaking sentences up into words. Simple as this may sound, there are always some things we can do better, for example,\u00a0how to handle verbs in various tense\/voice (e.g., study, studying and studied) and nouns in singular\/plural forms (e.g., datum\u00a0and data). This is where we need to introduce lemmatization.<\/span><\/p>\n<p><span style=\"font-family: 'times new roman', times, serif\">I have used the lemmatizer from nltk to solve this issue. In particular, it uses the WordNet which contains lexical data of English that can help to\u00a0find conceptual relationships between words such as\u00a0<span style=\"background-color: #f5f6f5\">antonyms,\u00a0<\/span><span style=\"background-color: #f5f6f5\">hyponyms,\u00a0<\/span>hypernyms, and\u00a0synonyms.<\/span><\/p>\n<p><span style=\"font-family: 'times new roman', times, serif\">After this step, all sentences should be divided into words with\u00a0uniformed forms.<\/span><\/p>\n<h2>\u00a0<span style=\"font-family: 'times new roman', times, serif\">Removing stop-words<\/span><\/h2>\n<p><span style=\"font-family: 'times new roman', times, serif\">Blindly passing all tokens into machine learning algorithm has been proven to be inefficient and inaccurate. A useful preprocessing step should at least contain stop-words removal.<\/span><\/p>\n<p><span style=\"font-family: 'times new roman', times, serif\">In this project, there are three kinds of stop-words to be removed, meaningless words,\u00a0<\/span><span style=\"font-family: 'times new roman', times, serif;background-color: #f5f6f5\">\u00a0<\/span><span style=\"font-family: 'times new roman', times, serif;background-color: #f5f6f5\">meaningless<\/span><span style=\"font-family: 'times new roman', times, serif;background-color: #f5f6f5\">\u00a0topics and\u00a0<\/span><span style=\"font-family: 'times new roman', times, serif\">document-frequent words.\u00a0<\/span><span style=\"font-family: 'times new roman', times, serif\">Meaningless words refer to the words such as &#8216;to&#8217;, &#8216;for&#8217;, and others similar ones. They do not have real meanings hence including them\u00a0only wastes computation powers.\u00a0<\/span><span style=\"font-family: 'times new roman', times, serif;background-color: #f5f6f5\">Meaningless<\/span><span style=\"font-family: 'times new roman', times, serif\">\u00a0topics are analogous to the meaningless words, they may have semantic\u00a0meanings, but they do not give any useful information as topics (e.g., we do not want to label an article with the topic &#8216;Mr.&#8217;). Document-frequent words refer to words frequently occur in all articles. We prefer to filter them out because they contradict to the idea of topics (i.e., what makes this article different from the others).<\/span><\/p>\n<p><span style=\"font-family: 'times new roman', times, serif\">Based on the definitions above, there are two ways for us to eliminate stop-words, TF-IDF, and stop-words list. TF-IDF stands for term frequency-inverse document frequency; it weights each word by dividing the number of its occurrence\u00a0in the current file by its presence in the whole set. Stop-word list contains all the stop-words we want to remove.\u00a0<\/span><span style=\"font-family: 'times new roman', times, serif\">While the stop-word list may take time and effort to build, we can have more control on it.\u00a0<\/span><span style=\"font-family: 'times new roman', times, serif;background-color: #f5f6f5\">In this project, I used an\u00a0existing\u00a0stop-word list.<\/span><\/p>\n<h2>\u00a0<span style=\"font-family: 'times new roman', times, serif\">Applying machine learning\u00a0algorithms<\/span><\/h2>\n<p><span style=\"font-family: 'times new roman', times, serif\">Now that all words are meaningful, it is time to feed them to the machine-learning algorithms. While there are numerous approaches, I used <em>Latent Dirichlet Allocation<\/em>\u00a0(LDA) model and <em>Non-negative Matrix Factorization<\/em>\u00a0(NMF) model. The LDA model\u00a0posits that each document is a mixture of a small number of topics and that each word&#8217;s creation is attributable to one of the document&#8217;s topics. This assumption is also akin to the standard bag of words model in our project. The NMF model\u00a0is less commonly used than the LDA model and has fewer parameters to be tuned. In theory, there are some di\ufb00erence in the assumptions of LDA and NMF, but here in this project, they have generated very similar topics, which is a good news showing the correctness of topics. Here I focused on the LDA algorithm.<\/span><\/p>\n<h2><span style=\"font-family: 'times new roman', times, serif\">Tuning parameters<\/span><\/h2>\n<p>After the model is built, we can further improve its accuracy by \ufb01nding the most optimal value for each parameter. In particular, we can tune two parameters in our LDA model, alpha and the total number of topics.<\/p>\n<p>Alpha denotes the sparsity of topics among stories. The smaller alpha, the more sparse the topic distribution. This is a bit like the long-tail theory, where most topics have about zero signi\ufb01cance while a limited number of topics are shared by most stories.<\/p>\n<p>In our project, I focused work on the second parameter, the total number of topics (I will call it topic_num for short).<\/p>\n<p>The simplest way to \ufb01nd optimal topic_num would be the brute-force, where we can start with topic_num equals to 1 and then 2, 3 and so on. Although this guarantees us with the correctness of the result, it simply takes too long and wastes tons of computation power.<\/p>\n<p>An alternative way is using recursion, which is a bit like binary search. We start from extreme values and gradually shrink the size until we \ufb01nd the optimal value. Again this also requires a relatively large computation power.<\/p>\n<p>The method I use is via solving polynomial equations. After testing the model on various sizes of data set, I found that the likelihood often \ufb01rst increases and then decreases with increasing topic_num. Based on this observation, I further assumed the polynomial relationship between likelihood and topic_num.<\/p>\n<p>This allows me to build the polynomial equation based on only a few pre-computed points and identify the maximum value straight away.<\/p>\n<p>The result shows this method exhibits a relatively good performance with significantly lowered costs in time and computation power.<\/p>\n<p>This model also \ufb01ts well in an increasing data set. For x new stories, instead of re-tune the model, we can simply assume topic_num will also increase by x +\/- e, where e is a small value.<\/p>\n<h2>Future works<\/h2>\n<p><span style=\"font-family: 'times new roman', times, serif\">Although the project is finished, there are some improvements to be made. Firstly, we can achieve a better lemmatization by giving the exact position of each word in the sentence. Secondly, we may further investigate the relationship between likelihood and topic_num to give a more accurate prediction. Thirdly, we can hack into the API of LDA model and try to reduce the number of iterations while training.<\/span><\/p>\n<h2><\/h2>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>During the Google Summer of Code program of this year (2017), I am working on a project about topic modelling. The intention of this project is to identify the topics of each news article so that they can be grouped for further researches or analysis. I write this post to share some thoughts of mine &hellip; <a href=\"https:\/\/archive.blogs.harvard.edu\/geeks\/2017\/08\/25\/some-thoughts-about-topic-modelling\/\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">Some thoughts about topic modelling<\/span><\/a><\/p>\n","protected":false},"author":8849,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-10","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"jetpack_featured_media_url":"","_links":{"self":[{"href":"https:\/\/archive.blogs.harvard.edu\/geeks\/wp-json\/wp\/v2\/posts\/10","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/archive.blogs.harvard.edu\/geeks\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/archive.blogs.harvard.edu\/geeks\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/archive.blogs.harvard.edu\/geeks\/wp-json\/wp\/v2\/users\/8849"}],"replies":[{"embeddable":true,"href":"https:\/\/archive.blogs.harvard.edu\/geeks\/wp-json\/wp\/v2\/comments?post=10"}],"version-history":[{"count":5,"href":"https:\/\/archive.blogs.harvard.edu\/geeks\/wp-json\/wp\/v2\/posts\/10\/revisions"}],"predecessor-version":[{"id":18,"href":"https:\/\/archive.blogs.harvard.edu\/geeks\/wp-json\/wp\/v2\/posts\/10\/revisions\/18"}],"wp:attachment":[{"href":"https:\/\/archive.blogs.harvard.edu\/geeks\/wp-json\/wp\/v2\/media?parent=10"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/archive.blogs.harvard.edu\/geeks\/wp-json\/wp\/v2\/categories?post=10"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/archive.blogs.harvard.edu\/geeks\/wp-json\/wp\/v2\/tags?post=10"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}