ProCoder Cafe

Python – Filtering Duplicate Words

Often, we need to analyze text for words that appear only once. Therefore, we need to eliminate duplicate words from the text. This can be achieved by using word tokens and set functions in nltk.

Not Preserving Order

In the following example, we first tokenize the sentence into words. Then we apply the set() function, which creates an unordered set of unique elements. The result is unique words, which are not sorted.

import nltk
word_data = "The Sky is blue, the ocean is blue, and the Rainbow has a blue color."

# First, tokenize the words
nltk_tokens = nltk.word_tokenize(word_data)

# Apply set
no_order = list(set(nltk_tokens))

print no_order

When we run the above program, we get the following output:

['blue', 'Rainbow', 'is', 'Sky', 'color', 'ocean', 'also', 'a', '.', 'The', 'has', 'the']

Preserve Order

To obtain the words after removing duplicates while preserving the order of the words in the sentence, we read the word and add it to the list. This is done by appending it.

import nltk
word_data = "The Sky is blue, the ocean is blue, and the Rainbow has a blue color."

# First, tokenize the words
nltk_tokens = nltk.word_tokenize(word_data)

ordered_tokens = set()
result = []
for word in nltk_tokens:
if word not in ordered_tokens:
ordered_tokens.add(word)
result.append(word)

print result

When we run the above program, we get the following output:

['The', 'Sky', 'is', 'blue', 'also', 'the', 'ocean', 'Rainbow', 'has', 'a', 'colour', '.']

Python – Filtering Duplicate Words

Not Preserving Order

Preserve Order

Related Posts

Python 3 – String isalnum() Method

Pytest test execution results in XML format

Comprehensive analysis of PLT image storage in Python

Leave a ReplyCancel Reply