Python – Removing Stop Words

Python – Removing Stop Words

Stop words are words such as the, he, and have that have no real meaning in an English sentence and can be ignored without affecting the meaning of the sentence. These words have been named stopwords by a corpus. First, download the file into the Python environment.

import nltk
nltk.download('stopwords')

This will download a file containing English stop words.

Validate stop words

from nltk.corpus import stopwords
stopwords.words('english')
print stopwords.words() [620:680]

When we run the above program, we get the following output −

[u'your', u'yours', u'yourself', u'yourselves', u'he', u'him', u'his', u'himself', u'she',
u"she's", u'her', u'hers', u'herself', u'it', u"it's", u'its', u'itself', u'they', u'them',
u'their', u'theirs', u'themselves', u'what', u'which', u'who', u'whom', u'this',
u'that', u"that'll", u'these', u'those', u'am', u'is', u'are', u'was', u'were', u'be',
u'been', u'being', u'have', u'has', u'had', u'having', u'do', u'does', u'did', u'doing',
u'a', u'an', u'the', u'and', u'but', u'if', u'or', u'because', u'as', u'until',
u'while', u'of', u'at']

These stop words are also found in the following languages.

<br>from nltk.corpus import stopwords
print stopwords.fileids()

When we run the above program, we get the following output −

[u'arabic', u'azerbaijani', u'danish', u'dutch', u'english', u'finnish',
u'french', u'german', u'greek', u'hungarian', u'indonesian', u'italian',
u'kazakh', u'nepali', u'norwegian', u'portuguese', u'romanian', u'russian',
u'spanish', u'swedish', u'turkish']

Example

We use the following example to demonstrate how to remove stop words from a word list.

from nltk.corpus import stopwords
en_stops = set(stopwords.words('english'))

all_words = ['There', 'is', 'a', 'tree', 'near', 'the', 'river']
for word in all_words:
if word not in en_stops:
print(word)

When we run the above program, we get the following output −

There
tree
near
river

Leave a Reply

Your email address will not be published. Required fields are marked *