Python filter repeated words

Filtering Duplicate Words in Python

Many times, we only need to analyze unique words in a document. Therefore, we need to eliminate duplicate words from the text. This can be achieved by using word tokenization and set functions provided by nltk.

Not Preserving Order

In the following example, we first tokenize the sentence into words. Then we apply the set() function, which creates an unordered set of unique elements. The result will be unique words with no order.

import nltk
word_data = "The Sky is blue also the ocean is blue also Rainbow has a blue color."

# First Word Tokenization
nltk_tokens = nltk.word_tokenize(word_data)

# Applying Set
no_order = list(set(nltk_tokens))

print no_order

When we run the above program, we get the following output –

['blue', 'Rainbow', 'is', 'Sky', 'colour', 'ocean', 'also', 'a', '.', 'The', 'has', 'the']

Preserve Order

To remove duplicate words while still preserving the order of the words in the sentence, we read the words and append them to a list.

import nltk
word_data = "The Sky is blue also the ocean is blue also Rainbow has a blue colour."
#First Word tokenization
nltk_tokens = nltk.word_tokenize(word_data)

ordered_tokens = set()
result = []
for word in nltk_tokens:
    if word not in ordered_tokens:
        ordered_tokens.add(word)
        result.append(word)

print result

When we run the above program, we get the following output −

['The', 'Sky', 'is', 'blue', 'also', 'the', 'ocean', 'Rainbow', 'has', 'a', 'colour', '.']

Leave a Reply

Your email address will not be published. Required fields are marked *