计算列表中单词的频率并按频率进行排序

我正在使用 Python 3.3

我需要创建两个列表,一个用于独特的单词,另一个用于单词的频率。

我必须根据频率列表对唯一的单词列表进行排序,以便频率最高的单词在列表中排在第一位。

我有文本中的设计,但不确定如何在 Python 中实现它。

到目前为止,我发现的方法要么使用 Counter,要么使用我们还没有学过的字典。我已经从包含所有单词的文件中创建了列表,但不知道如何找到列表中每个单词的频率。我知道我将需要一个循环这样做,但不能弄清楚它。

基本设计如下:

 original list = ["the", "car",....]
newlst = []
frequency = []
for word in the original list
if word not in newlst:
newlst.append(word)
set frequency = 1
else
increase the frequency
sort newlst based on frequency list
272006 次浏览

The ideal way is to use a dictionary that maps a word to it's count. But if you can't use that, you might want to use 2 lists - 1 storing the words, and the other one storing counts of words. Note that order of words and counts matters here. Implementing this would be hard and not very efficient.

Using Counter would be the best way, but if you don't want to do that, you can implement it yourself this way.

# The list you already have
word_list = ['words', ..., 'other', 'words']
# Get a set of unique words from the list
word_set = set(word_list)
# create your frequency dictionary
freq = {}
# iterate through them, once per unique word.
for word in word_set:
freq[word] = word_list.count(word) / float(len(word_list))

freq will end up with the frequency of each word in the list you already have.

You need float in there to convert one of the integers to a float, so the resulting value will be a float.

Edit:

If you can't use a dict or set, here is another less efficient way:

# The list you already have
word_list = ['words', ..., 'other', 'words']
unique_words = []
for word in word_list:
if word not in unique_words:
unique_words += [word]
word_frequencies = []
for word in unique_words:
word_frequencies += [float(word_list.count(word)) / len(word_list)]
for i in range(len(unique_words)):
print(unique_words[i] + ": " + word_frequencies[i])

The indicies of unique_words and word_frequencies will match.

words = file("test.txt", "r").read().split() #read the words into a list.
uniqWords = sorted(set(words)) #remove duplicate words and sort
for word in uniqWords:
print words.count(word), word

One way would be to make a list of lists, with each sub-list in the new list containing a word and a count:

list1 = []    #this is your original list of words
list2 = []    #this is a new list


for word in list1:
if word in list2:
list2.index(word)[1] += 1
else:
list2.append([word,0])

Or, more efficiently:

for word in list1:
try:
list2.index(word)[1] += 1
except:
list2.append([word,0])

This would be less efficient than using a dictionary, but it uses more basic concepts.

You can use

from collections import Counter

It supports Python 2.7,read more information here

1.

>>>c = Counter('abracadabra')
>>>c.most_common(3)
[('a', 5), ('r', 2), ('b', 2)]

use dict

>>>d={1:'one', 2:'one', 3:'two'}
>>>c = Counter(d.values())
[('one', 2), ('two', 1)]

But, You have to read the file first, and converted to dict.

2. it's the python docs example,use re and Counter

# Find the ten most common words in Hamlet
>>> import re
>>> words = re.findall(r'\w+', open('hamlet.txt').read().lower())
>>> Counter(words).most_common(10)
[('the', 1143), ('and', 966), ('to', 762), ('of', 669), ('i', 631),
('you', 554),  ('a', 546), ('my', 514), ('hamlet', 471), ('in', 451)]

use this

from collections import Counter
list1=['apple','egg','apple','banana','egg','apple']
counts = Counter(list1)
print(counts)
# Counter({'apple': 3, 'egg': 2, 'banana': 1})

Yet another solution with another algorithm without using collections:

def countWords(A):
dic={}
for x in A:
if not x in  dic:        #Python 2.7: if not dic.has_key(x):
dic[x] = A.count(x)
return dic


dic = countWords(['apple','egg','apple','banana','egg','apple'])
sorted_items=sorted(dic.items())   # if you want it sorted

You can use reduce() - A functional way.

words = "apple banana apple strawberry banana lemon"
reduce( lambda d, c: d.update([(c, d.get(c,0)+1)]) or d, words.split(), {})

returns:

{'strawberry': 1, 'lemon': 1, 'apple': 2, 'banana': 2}

Try this:

words = []
freqs = []


for line in sorted(original list): #takes all the lines in a text and sorts them
line = line.rstrip() #strips them of their spaces
if line not in words: #checks to see if line is in words
words.append(line) #if not it adds it to the end words
freqs.append(1) #and adds 1 to the end of freqs
else:
index = words.index(line) #if it is it will find where in words
freqs[index] += 1 #and use the to change add 1 to the matching index in freqs

the best thing to do is :

def wordListToFreqDict(wordlist):
wordfreq = [wordlist.count(p) for p in wordlist]
return dict(zip(wordlist, wordfreq))

then try to : wordListToFreqDict(originallist)

Here is code support your question is_char() check for validate string count those strings alone, Hashmap is dictionary in python

def is_word(word):
cnt =0
for c in word:


if 'a' <= c <='z' or 'A' <= c <= 'Z' or '0' <= c <= '9' or c == '$':
cnt +=1
if cnt==len(word):
return True
return False


def words_freq(s):
d={}
for i in s.split():
if is_word(i):
if i in d:
d[i] +=1
else:
d[i] = 1
return d


print(words_freq('the the sky$ is blue not green'))

Pandas answer:

import pandas as pd
original_list = ["the", "car", "is", "red", "red", "red", "yes", "it", "is", "is", "is"]
pd.Series(original_list).value_counts()

If you wanted it in ascending order instead, it is as simple as:

pd.Series(original_list).value_counts().sort_values(ascending=True)
for word in original_list:
words_dict[word] = words_dict.get(word,0) + 1


sorted_dt = {key: value for key, value in sorted(words_dict.items(), key=lambda item: item[1], reverse=True)}


keys = list(sorted_dt.keys())
values = list(sorted_dt.values())
print(keys)
print(values)

Simple way

d = {}
l = ['Hi','Hello','Hey','Hello']
for a in l:
d[a] = l.count(a)
print(d)
Output : {'Hi': 1, 'Hello': 2, 'Hey': 1}

word and frequency if you need

def counter_(input_list_):
lu = []
for v in input_list_:
ele = (v, lc.count(v)/len(lc)) #if you don't % remove <</len(lc)>>
if ele not in lu:
lu.append(ele)
return lu


counter_(['a', 'n', 'f', 'a'])

output:

[('a', 0.5), ('n', 0.25), ('f', 0.25)]