Python – Part-of-Speech Tagging
Python – Part-of-Speech Tagging
Tag is a fundamental feature of text processing, marking words into grammatical categories. We create tags for each word using word segmentation and the pos_tag function.
import nltk
text = nltk.word_tokenize("A Python is a serpent which eats eggs from the nest")
tagged_text=nltk.pos_tag(text)
print(tagged_text)
When running the above program, we get the following output −
[('A', 'DT'), ('Python', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('serpent', 'NN'),
('which', 'WDT'), ('eats', 'VBZ'), ('eggs', 'NNS'), ('from', 'IN'),
('the', 'DT'), ('nest', 'JJS')]
Tag Description
The following program uses built-in values to describe the meaning of each tag.
import nltk
nltk.help.upenn_tagset('NN')
nltk.help.upenn_tagset('IN')
nltk.help.upenn_tagset('DT')
When we run the above program, we get the following output −
NN: noun, common, singular, or mass
common-carrier, cabbage, knuckle-duster, Casino, afghan, shed, thermostat
investment, slide, humor, falloff, slick, wind, hyena, override, subhumanity
machinist ...
IN: preposition or conjunction, subordinate
astride among, uppon, whether, out, inside, pro, despite, on, by, throughout
below, within, for, towards, near, behind, atop, around, if, like, until below
next, into, if, beside ...
DT: determiner
all, an, another, any, both, del, each, either, every, half, la, many much nary
neither no some such that the them these this those
Tag the Corpus
We can also tag corpus data and view the tagging results for each word in the corpus.
import nltk
from nltk.tokenize import sent_tokenize
from nltk.corpus import gutenberg
sample = gutenberg.raw("blake-poems.txt")
tokenized = sent_tokenize(sample)
for i in tokenized[:2]:
words = nltk.word_tokenize(i)
tagged = nltk.pos_tag(words)
print(tagged)
When running the above program, we get the following output −
[([', 'JJ'), (Poems', 'NNP'), (by', 'IN'), (William', 'NNP'), (Blake', 'NNP'), (1789', 'CD'),
(]', 'NNP'), (SONGS', 'NNP'), (OF', 'NNP'), (INNOCENCE', 'NNP'), (AND', 'NNP'), (OF', 'NNP'),
(EXPERIENCE', 'NNP'), (and', 'CC'), (THE', 'NNP'), (BOOK', 'NNP'), (of', 'IN'),
(THEL', 'NNP'), (SONGS', 'NNP'), (OF', 'NNP'), (INNOCENCE', 'NNP'), (INTRODUCTION', 'NNP'),
(Piping', 'VBG'), (down', 'RP'), (the', 'DT'), (valleys', 'NN'), (wild', 'JJ'),
(,', ','), (Piping', 'NNP'), (songs', 'NNS'), (of', 'IN'), (pleasant', 'JJ'), (glee', 'NN'),
(,', ','), (On', 'IN'), (a', 'DT'), (cloud', 'NN'), (I', 'PRP'), (saw', 'VBD'),
(a', 'DT'), (child', 'NN'), (,', ','), (And', 'CC'), (he', 'PRP'), (laughing', 'VBG'),
(said', 'VBD'), (to', 'TO'), (me', 'PRP'), (:', ':'), (``', '``'), (Pipe', 'VB'),
(a', 'DT'), (song', 'NN'), (about', 'IN'), (a', 'DT'), (Lamb', 'NN'), (!', '.'), (u"''", "''")]