Python stop word removal

Removing Stop Words in Python

Stop words are words that don’t have much meaning in an English sentence. They can be safely ignored without losing the meaning of the sentence. For example, words like the, he, have, etc. These words have been captured in a corpus called corpus. We will first download it into our Python environment.

import nltk
nltk.download('stopwords')

This will download a file containing English stop words.

Validate stop words

from nltk.corpus import stopwords
stopwords.words('english')
print stopwords.words() [620:680]

When we run the above program we get the following output-

[u'your', u'yours', u'yourself', u'yourselves', u'he', u'him', u'his', u'himself', u'she',
u"she's", u'her', u'hers', u'herself', u'it', u"it's", u'its', u'itself', u'they', u'them',
u'their', u'theirs', u'themselves', u'what', u'which', u'who', u'whom', u'this',
u'that', u"that'll", u'these', u'those', u'am', u'is', u'are', u'was', u'were', u'be',
u'been', u'being', u'have', u'has', u'had', u'having', u'do', u'does', u'did', u'doing',
u'a', u'an', u'the', u'and', u'but', u'if', u'or', u'because', u'as', u'until',
u'while', u'of', u'at']

Besides English, other languages that have these stop words are as follows:

from nltk.corpus import stopwords
print stopwords.fileids()

When we run the above program, we get the following output –

[u'arabic', u'azerbaijani', u'danish', u'dutch', u'english', u'finnish',
u'french', u'german', u'greek', u'hungarian', u'indonesian', u'italian',
u'kazakh', u'nepali', u'norwegian', u'portuguese', u'romanian', u'russian',
u'spanish', u'swedish', u'turkish']

Example

We use the following example to demonstrate how to remove stop words from a word list.

from nltk.corpus import stopwords
en_stops = set(stopwords.words('english'))

all_words = ['There', 'is', 'a', 'tree', 'near', 'the', 'river']

for word in all_words:

if word not in en_stops:
print(word)

When we run the above program, we get the following output –

There
tree
near
river

Leave a Reply

Your email address will not be published. Required fields are marked *