Python – Filtering Duplicate Words
Python – Filtering Duplicate Words
Often, we need to analyze text for words that appear only once. Therefore, we need to eliminate duplicate words from the text. This can be achieved by using word tokens and set functions in nltk.
Not Preserving Order
In the following example, we first tokenize the sentence into words. Then we apply the set() function, which creates an unordered set of unique elements. The result is unique words, which are not sorted.
import nltk
word_data = "The Sky is blue, the ocean is blue, and the Rainbow has a blue color."
# First, tokenize the words
nltk_tokens = nltk.word_tokenize(word_data)
# Apply set
no_order = list(set(nltk_tokens))
print no_order
When we run the above program, we get the following output:
['blue', 'Rainbow', 'is', 'Sky', 'color', 'ocean', 'also', 'a', '.', 'The', 'has', 'the']
Preserve Order
To obtain the words after removing duplicates while preserving the order of the words in the sentence, we read the word and add it to the list. This is done by appending it.
import nltk
word_data = "The Sky is blue, the ocean is blue, and the Rainbow has a blue color."
# First, tokenize the words
nltk_tokens = nltk.word_tokenize(word_data)
ordered_tokens = set()
result = []
for word in nltk_tokens:
if word not in ordered_tokens:
ordered_tokens.add(word)
result.append(word)
print result
When we run the above program, we get the following output:
['The', 'Sky', 'is', 'blue', 'also', 'the', 'ocean', 'Rainbow', 'has', 'a', 'colour', '.']