Although, using Counter
from collections
library as suggested is better approach, but counting objects frequencies in run-time is a better approach to save memory, and to avoid duplication and redundancy (I believe that will be an answer for new Python learner):
From comment in your code it seem like you wants to improve your code. And I think you are able to read file content in words (while usually I avoid using read()
function and use for line in file_descriptor:
kind of code).
As words
is a string, In for loop, for i in words:
the loop-variable i
is not a word but a char. You are iterating over chars in string instead of iterating over words in string words
. To understand this notice following code snipe:
>>> for i in "Hi, h r u?":
... print i
...
H
i
,
h
r
u
?
>>>
Because iterating over string char by chars instead word by words is not what you wanted, to iterate words by words you should split method/function from string class in Python.
str.split(str="", num=string.count(str))
method returns a list of all the words in the string, using str as the separator (splits on all whitespace if left unspecified), optionally limiting the number of splits to num.
Notice below code examples:
Split:
>>> "Hi, how are you?".split()
['Hi,', 'how', 'are', 'you?']
loop with split:
>>> for i in "Hi, how are you?".split():
... print i
...
Hi,
how
are
you?
And it looks something like you needs. Except word Hi,
because split()
by default split by whitespaces so Hi,
are kept as a single string (and obviously) that you don’t want. To count frequency of words in the file.
One good solution can be that use regex, But first to keep answer simple I answering with replace()
method. The method str.replace(old, new[, max])
returns a copy of the string in which the occurrences of old have been replaced with new, optionally restricting the number of replacements to max.
Now check below code example for what I want to suggest:
>>> "Hi, how are you?".split()
['Hi,', 'how', 'are', 'you?'] # it has , with Hi
>>> "Hi, how are you?".replace(',', ' ').split()
['Hi', 'how', 'are', 'you?'] # , replaced by space then split
loop:
>>> for word in "Hi, how are you?".replace(',', ' ').split():
... print word
...
Hi
how
are
you?
Now, how to count frequency:
One way is use Counter as @Michael suggested, but to use your approach in which you wants to start from empty dict. Do something like this code:
words = f.read()
wordfreq = {}
for word in .replace(', ',' ').split():
wordfreq[word] = wordfreq.setdefault(word, 0) + 1
# ^^ add 1 to 0 or old value from dict
What I am doing?: because initially wordfreq
is empty you can’t assign to wordfreq[word]
at first time(it will rise key exception). so I used setdefault dict method.
dict.setdefault(key, default=None)
is similar to get()
, but will set dict[key]=default
if key is not already in dict. So for first time when a new word comes I set it with 0
in dict using setdefault
then add 1
and assign to same dict.
I written an equivalent code using with open instead of single open
.
with open('~/Desktop/file') as f:
words = f.read()
wordfreq = {}
for word in words.replace(',', ' ').split():
wordfreq[word] = wordfreq.setdefault(word, 0) + 1
print wordfreq
That runs like this:
$ cat file # file is
this is the textfile, and it is used to take words and count
$ python work.py # indented manually
{'and': 2, 'count': 1, 'used': 1, 'this': 1, 'is': 2,
'it': 1, 'to': 1, 'take': 1, 'words': 1,
'the': 1, 'textfile': 1}
Using re.split(pattern, string, maxsplit=0, flags=0)
Just change for loop: for i in re.split(r"[,\s]+", words):
, that should produce correct output.
Edit: better to find all alphanumeric character because you may have more than one punctuation symbols.
>>> re.findall(r'[\w]+', words) # manually indent output
['this', 'is', 'the', 'textfile', 'and',
'it', 'is', 'used', 'to', 'take', 'words', 'and', 'count']
use for loop as: for word in re.findall(r'[\w]+', words):
How would I write code without using read()
:
File is:
$ cat file
This is the text file, and it is used to take words and count. And multiple
Lines can be present in this file.
It is also possible that Same words repeated in with capital letters.
Code is:
$ cat work.py
import re
wordfreq = {}
with open('file') as f:
for line in f:
for word in re.findall(r'[\w]+', line.lower()):
wordfreq[word] = wordfreq.setdefault(word, 0) + 1
print wordfreq
Used lower()
to convert upper letter to lower.
output:
$python work.py # manually strip output
{'and': 3, 'letters': 1, 'text': 1, 'is': 3,
'it': 2, 'file': 2, 'in': 2, 'also': 1, 'same': 1,
'to': 1, 'take': 1, 'capital': 1, 'be': 1, 'used': 1,
'multiple': 1, 'that': 1, 'possible': 1, 'repeated': 1,
'words': 2, 'with': 1, 'present': 1, 'count': 1, 'this': 2,
'lines': 1, 'can': 1, 'the': 1}