{"id":1807,"date":"2015-09-03T00:00:00","date_gmt":"2015-09-02T22:00:00","guid":{"rendered":"https:\/\/wwwneu.strehle.de\/tim\/weblog\/archives\/2015\/09\/03\/1569-2\/"},"modified":"2025-07-31T21:59:56","modified_gmt":"2025-07-31T19:59:56","slug":"1569-2","status":"publish","type":"post","link":"https:\/\/www.strehle.de\/tim\/weblog\/archives\/2015\/09\/03\/1569-2\/","title":{"rendered":"Counting word frequency using NLTK FreqDist()"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">A pretty simple programming task: Find the most-used words in a text and count how often they\u2019re used. (With the goal of later creating a pretty <a href=\"http:\/\/www.wordle.net\">Wordle<\/a>-like word cloud from this data.)<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">I assumed there would be some existing tool or code, and <a href=\"https:\/\/twitter.com\/rogerhoward\/status\/632684621264629760\">Roger Howard said<\/a> NLTK\u2019s <a href=\"http:\/\/www.nltk.org\/api\/nltk.html#nltk.probability.FreqDist\">FreqDist()<\/a> was \u201ceasy as pie\u201d.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">So today I wrote the first Python program of my life, using <a href=\"http:\/\/www.nltk.org\">NLTK, the Natural Language Toolkit<\/a>. With the help of the <a href=\"http:\/\/www.nltk.org\/book\/ch01.html#frequency-distributions\">NLTK tutorial<\/a> and StackOverflow. I\u2019m sure it\u2019s terrible Python and bad use of NLTK. Sorry, I\u2019m a total newbie.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In my example, the whole text is read into memory. I\u2019d like to process a lot of text (say, a 300 MB text file) \u2013 can NLTK do this? My 30 MB file took 40 seconds to process. Anyway, I\u2019m happy I got it working at all.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>#!\/usr\/bin\/env python\n# coding=UTF-8\n#\n# Output the 50 most-used words from a text file, using NLTK FreqDist()\n# (The text file must be in UTF-8 encoding.)\n#\n# Usage:\n#\n#   .\/freqdist_top_words.py input.txt\n#\n# Sample output:\n#\n# et;8\n# dolorem;5\n# est;4\n# aut;4\n# sint;4\n# dolor;4\n# laborum;3\n# ...\n#\n# Requires NLTK. Official installation docs: http:\/\/www.nltk.org\/install.html\n#\n# I installed it on my Debian box like this:\n#\n# sudo apt-get install python-pip\n# sudo pip install -U nltk\n# python\n# &gt;&gt;&gt; import nltk\n# &gt;&gt;&gt; nltk.download('stopwords')\n# &gt;&gt;&gt; nltk.download('punkt')\n# &gt;&gt;&gt; exit()\n\nimport sys\nimport codecs\nimport nltk\nfrom nltk.corpus import stopwords\n\n# NLTK's default German stopwords\ndefault_stopwords = set(nltk.corpus.stopwords.words('german'))\n\n# We're adding some on our own - could be done inline like this...\n# custom_stopwords = set((u'\u2013', u'dass', u'mehr'))\n# ... but let's read them from a file instead (one stopword per line, UTF-8)\nstopwords_file = '.\/stopwords.txt'\ncustom_stopwords = set(codecs.open(stopwords_file, 'r', 'utf-8').read().splitlines())\n\nall_stopwords = default_stopwords | custom_stopwords\n\ninput_file = sys.argv&#91;1]\n\nfp = codecs.open(input_file, 'r', 'utf-8')\n\nwords = nltk.word_tokenize(fp.read())\n\n# Remove single-character tokens (mostly punctuation)\nwords = &#91;word for word in words if len(word) &gt; 1]\n\n# Remove numbers\nwords = &#91;word for word in words if not word.isnumeric()]\n\n# Lowercase all words (default_stopwords are lowercase too)\nwords = &#91;word.lower() for word in words]\n\n# Stemming words seems to make matters worse, disabled\n# stemmer = nltk.stem.snowball.SnowballStemmer('german')\n# words = &#91;stemmer.stem(word) for word in words]\n\n# Remove stopwords\nwords = &#91;word for word in words if word not in all_stopwords]\n\n# Calculate frequency distribution\nfdist = nltk.FreqDist(words)\n\n# Output top 50 words\n\nfor word, frequency in fdist.most_common(50):\n    print(u'{};{}'.format(word, frequency))\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">If you want to teach me better Python, I\u2019m open to suggestions for improvement \ud83d\ude42<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><em>Update:<\/em> You can also find that script <a href=\"https:\/\/github.com\/tistre\/nltk-examples\">on GitHub<\/a>. Many thanks to Roger Howard for <a href=\"https:\/\/github.com\/tistre\/nltk-examples\/pull\/2\">improving it<\/a>!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>A pretty simple programming task: Find the most-used words in a text and count how often they\u2019re used. (With the goal of later creating a pretty Wordle-like word cloud from this data.) I assumed there would be some existing tool or code, and Roger Howard said NLTK\u2019s FreqDist() was \u201ceasy as pie\u201d. So today I [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":"","_share_on_mastodon":"0"},"categories":[1],"tags":[],"class_list":["post-1807","post","type-post","status-publish","format-standard","hentry","category-weblog"],"share_on_mastodon":{"url":"","error":""},"_links":{"self":[{"href":"https:\/\/www.strehle.de\/tim\/wp-json\/wp\/v2\/posts\/1807","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.strehle.de\/tim\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.strehle.de\/tim\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.strehle.de\/tim\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.strehle.de\/tim\/wp-json\/wp\/v2\/comments?post=1807"}],"version-history":[{"count":1,"href":"https:\/\/www.strehle.de\/tim\/wp-json\/wp\/v2\/posts\/1807\/revisions"}],"predecessor-version":[{"id":1919,"href":"https:\/\/www.strehle.de\/tim\/wp-json\/wp\/v2\/posts\/1807\/revisions\/1919"}],"wp:attachment":[{"href":"https:\/\/www.strehle.de\/tim\/wp-json\/wp\/v2\/media?parent=1807"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.strehle.de\/tim\/wp-json\/wp\/v2\/categories?post=1807"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.strehle.de\/tim\/wp-json\/wp\/v2\/tags?post=1807"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}