Trying To Output The X Most Common Words In A Text File
Solution 1:
Use the collections.Counter
class.
from collections import Counter
for word, count in Counter(words).most_common(30):
print(word, count)
Some unsolicited advice: Don't make so many functions until everything is working as one big block of code. Refactor into functions after it works. You don't even need a main section for a script this small.
Solution 2:
Using itertools
' groupby
:
from itertools import groupby
words = sorted([w.lower() for w inopen("/path/to/file").read().split()])
count = [[item[0], len(list(item[1]))] for item in groupby(words)]
count.sort(key=lambda x: x[1], reverse = True)
for item in count[:5]:
print(*item)
This will list the file's words, sort them and list unique words and their occurrence. Subsequently, the found list is sorted by occurrence by:
count.sort(key=lambda x: x[1], reverse = True)
The
reverse = True
is to list the most common words first.In the line:
for item in count[:5]:
[:5]
defines the number of most occurring words to show.
Solution 3:
First method as others have suggested i.e. by using most_common(...)
doesn't work according to your needs cause it returns the nth first most common words and not the words whose count is less than or equal to n
:
Here's using most_common(...)
: note it just print the first nth most common words:
>>>import re...from collections import Counter...defprint_top(filename, max_count):... words = re.findall(r'\w+', open(filename).read().lower())...for word, count in Counter(words).most_common(max_count):...print word, count...print_top('n.sh', 1)
force 1
The correct way would be as follows, note it prints all the words whose count is less than equal to count:
>>>import re...from collections import Counter...defprint_top(filename, max_count):... words = re.findall(r'\w+', open(filename).read().lower())...for word, count infilter(lambda x: x[1]<=max_count, sorted(Counter(words).items(), key=lambda x: x[1], reverse=True)):...print word, count...print_top('n.sh', 1)
force 1
in 1
done 1
mysql 1
yes 1
egrep 1
for 1
1 1
print 1
bin 1
do 1
awk 1
reinstall 1
bash 1
mythtv 1
selections 1
install 1
v 1
y 1
Solution 4:
Here is my python3 solution. I was asked this question in an interview and the interviewer was happy this solution, albeit in a less time-constrained situation the other solutions provided above seem a lot nicer to me.
dict_count = {}
lines = []
file = open("logdata.txt", "r")
for line in file:# open("logdata.txt", "r"):
lines.append(line.replace('\n', ''))
for line inlines:
if line notin dict_count:
dict_count[line] = 1else:
num = dict_count[line]
dict_count[line] = (num + 1)
def greatest(words):
greatest = 0string = ''for key, val in words.items():
if val > greatest:
greatest = val
string = key
return [greatest, string]
most_common = []
def n_most_common_words(n, words):
whilelen(most_common) < n:
most_common.append(greatest(words))
del words[(greatest(words)[1])]
n_most_common_words(20, dict_count)
print(most_common)
Post a Comment for "Trying To Output The X Most Common Words In A Text File"