Python – Text Processing
Python – Text Processing
Generally speaking, munging means cleaning up anything messy by transforming it. In our case, we’ll see how transforming text can yield some results that create desirable changes in our data. At a simple level, it’s simply about transforming the text we’re processing.
Example
In the following example, we’ll shuffle and rearrange all the letters of a sentence, except the first and last, to get possible alternate words that might be generated by human typos. This rearrangement can help us identify common misspellings when counting misspellings and provide the correct spelling for them.
import random
import re
def replace(t):
inner_word = list(t.group(2))
random.shuffle(inner_word)
return t.group(1) + "".join(inner_word) + t.group(3)
text = "Hello, You should reach the finish line."
print re.sub(r"(w)(w+)(w)", replace, text)
print re.sub(r"(w)(w+)(w)", replace, text)
When we run the above program, we get the following output −
Hlleo, You slouhd raech the fsiinh lnie.
Hlleo, You suolhd raceh the fniish line.
Here, you can see that the word is jumbled, except for the first and last letters. By taking a statistical approach to spelling errors, we can identify commonly misspelled words and provide the correct spelling for them.