Skip to content Skip to sidebar Skip to footer

Trying To Output The X Most Common Words In A Text File

I'm trying to write a program that will read in a text file and output a list of most common words (30 as the code is written now) along with their counts. so something like: word1

Solution 1:

Use the collections.Counter class.

from collections import Counter

for word, count in Counter(words).most_common(30):
    print(word, count)

Some unsolicited advice: Don't make so many functions until everything is working as one big block of code. Refactor into functions after it works. You don't even need a main section for a script this small.

Solution 2:

Using itertools' groupby:

from itertools import groupby

words = sorted([w.lower() for w inopen("/path/to/file").read().split()])
count = [[item[0], len(list(item[1]))] for item in groupby(words)]
count.sort(key=lambda x: x[1], reverse = True)
for item in count[:5]:
    print(*item)
  • This will list the file's words, sort them and list unique words and their occurrence. Subsequently, the found list is sorted by occurrence by:

    count.sort(key=lambda x: x[1], reverse = True)
    
  • The reverse = True is to list the most common words first.

  • In the line:

    for item in count[:5]:
    

    [:5] defines the number of most occurring words to show.

Solution 3:

First method as others have suggested i.e. by using most_common(...) doesn't work according to your needs cause it returns the nth first most common words and not the words whose count is less than or equal to n:

Here's using most_common(...): note it just print the first nth most common words:

>>>import re...from collections import Counter...defprint_top(filename, max_count):...    words = re.findall(r'\w+', open(filename).read().lower())...for word, count in Counter(words).most_common(max_count):...print word, count...print_top('n.sh', 1)
force 1

The correct way would be as follows, note it prints all the words whose count is less than equal to count:

>>>import re...from collections import Counter...defprint_top(filename, max_count):...    words = re.findall(r'\w+', open(filename).read().lower())...for word, count infilter(lambda x: x[1]<=max_count, sorted(Counter(words).items(), key=lambda x: x[1], reverse=True)):...print word, count...print_top('n.sh', 1)
force 1
in 1
done 1
mysql 1
yes 1
egrep 1
for 1
1 1
print 1
bin 1
do 1
awk 1
reinstall 1
bash 1
mythtv 1
selections 1
install 1
v 1
y 1

Solution 4:

Here is my python3 solution. I was asked this question in an interview and the interviewer was happy this solution, albeit in a less time-constrained situation the other solutions provided above seem a lot nicer to me.

    dict_count = {}
    lines = []

    file = open("logdata.txt", "r")

    for line in file:# open("logdata.txt", "r"):
        lines.append(line.replace('\n', ''))

    for line inlines:
        if line notin dict_count:
            dict_count[line] = 1else:
            num = dict_count[line]
            dict_count[line] = (num + 1)

    def greatest(words):
        greatest = 0string = ''for key, val in words.items():
            if val > greatest:
                greatest = val
                string = key
        return [greatest, string]

    most_common = []
    def n_most_common_words(n, words):
        whilelen(most_common) < n:
            most_common.append(greatest(words))
            del words[(greatest(words)[1])]

    n_most_common_words(20, dict_count)

    print(most_common)

Post a Comment for "Trying To Output The X Most Common Words In A Text File"