Python3 XML parsing
Python 3 XML Parsing
What is XML?
XML stands for Extensible Markup Language (eXtensible Markup Language), a subset of Standard Generalized Markup Language (SGML). It is a markup language used to structure electronic documents.
You can learn XML tutorials on this site.
XML is designed for transmitting and storing data.
XML is a set of rules for defining semantic markup that divides documents into components and identifies these components.
It is also a meta-markup language, meaning it defines a syntax for defining other domain-specific, semantic, and structured markup languages.
Python Parsing XML
Common XML programming interfaces include DOM and SAX. These two interfaces process XML files differently and are therefore used in different contexts.
Python has three methods for parsing XML: SAX, DOM, and ElementTree:
1. SAX (simple API for XML)
Python The standard library includes a SAX parser. SAX uses an event-driven model to parse XML. The XML file is processed by triggering events and calling user-defined callback functions.
2. DOM (Document Object Model)
Parses XML data into a tree in memory, and manipulates the XML by operating on the tree.
The XML example file movies.xml used in this chapter has the following content:
<collection shelf="New Arrivals">
<movie title="Enemy Behind">
<type>War, Thriller</type>
<format>DVD</format>
<year>2003</year>
<rating>PG</rating>
<stars>10</stars>
<description>Talk about a US-Japan war</description>
</movie>
<movie title="Transformers">
<type>Anime, Science Fiction</type>
<format>DVD</format>
<year>1989</year>
<rating>R</rating>
<stars>8</stars>
<description>A schientific fiction</description>
</movie>
<movie title="Trigun">
<type>Anime, Action</type>
<format>DVD</format>
<episodes>4</episodes>
<rating>PG</rating>
<stars>10</stars>
<description>Vash the Stampede!</description>
</movie>
<movie title="Ishtar">
<type>Comedy</type>
<format>VHS</format>
<rating>PG</rating>
<stars>2</stars>
<description>Viewable boredom</description>
</movie>
</collection>
Parsing XML with SAX in Python
SAX is an event-driven API.
Parsing XML documents with SAX involves two parts: a parser and an event handler.
The parser is responsible for reading XML documents and sending events, such as element start and end events, to event handlers.
The event handlers are responsible for responding to these events and processing the XML data.
- 1. Processing large files;
- 2. Requires only a portion of the file’s contents or specific information.
- 3. Requires building your own object model.
To process XML in Python using SAX, you must first import the parse function from xml.sax and the ContentHandler from xml.sax.handler.
ContentHandler Class Method Introduction
characters(content) Method
When to Call:
From the beginning of the line, before encountering a tag, if there are characters, the value of content is the string.
From a tag, before encountering the next tag, if there are characters, the value of content is the string.
From a tag, before encountering the end of the line, if there are characters, the value of content is the string.
A tag can be either a start tag or an end tag.
startDocument() Method
Called when a document starts.
endDocument() Method
Called when the parser reaches the end of the document.
startElement(name, attrs) Method
Called when an XML start tag is encountered. name is the tag name, and attrs is a dictionary of attribute values for the tag.
endElement(name) Method
Called when an XML end tag is encountered.
make_parser Method
The following method creates and returns a new parser object.
xml.sax.make_parser([parser_list])
Parameter Description:
- parser_list –
Optional parameter, parser list
parser Method
The following method creates a SAX parser and parses an XML document:
xml.sax.parse(xmlfile, contenthandler[, errorhandler])
Parameter Description:
- xmlfile –
XML file name - contenthandler –
Must be a ContentHandler object - errorhandler –
If this parameter is specified, errorhandler must be a SAX ErrorHandler object.
parseString Method
The parseString method creates an XML parser and parses an XML string:
xml.sax.parseString(xmlstring, contenthandler[, errorhandler])
Parameter Description:
- xmlstring –
xmlstring - contenthandler –
Must be a ContentHandler object. - errorhandler –
If this parameter is specified, errorhandler must be a SAX ErrorHandler object.
Python XML Parsing Example
#!/usr/bin/python3
import xml.sax
class MovieHandler(xml.sax.ContentHandler):
def __init__(self):
self.CurrentData = ""
self.type = ""
self.format = ""
self.year = ""
self.rating = ""
self.stars = ""
self.description = ""
# Element starts calling
def startElement(self, tag, attributes):
self.CurrentData = tag
if tag == "movie":
print ("****Movie*****")
title = attributes["title"]
print ("Title:", title)
# End of element call
def endElement(self, tag):
if self.CurrentData == "type":
print ("Type:", self.type)
elif self.CurrentData == "format":
print ("Format:", self.format)
elif self.CurrentData == "year":
print ("Year:", self.year)
elif self.CurrentData == "rating":
print ("Rating:", self.rating)
elif self.CurrentData == "stars":
print ("Stars:", self.stars)
elif self.CurrentData == "description":
print ("Description:", self.description)
self.CurrentData = ""
# Called when reading characters
def characters(self, content):
if self.CurrentData == "type":
self.type = content
elif self.CurrentData == "format":
self.format = content
elif self.CurrentData == "year":
self.year = content
elif self.CurrentData == "rating":
self.rating = content
elif self.CurrentData == "stars":
self.stars = content
elif self.CurrentData == "description":
self.description = content
if ( __name__ == "__main__"):
# Create an XMLReader
parser = xml.sax.make_parser()
# Disable namespaces
parser.setFeature(xml.sax.handler.feature_namespaces, 0)
# Override the ContextHandler
Handler = MovieHandler()
parser.setContentHandler( Handler )
parser.parse("movies.xml")
The above code produces the following results:
*****Movie*****
Title: Enemy Behind
Type: War, Thriller
Format: DVD
Year: 2003
Rating: PG
Stars: 10
Description: Talk about a US-Japan war
*****Movie*****
Title: Transformers
Type: Anime, Science Fiction
Format: DVD
Year: 1989
Rating: R
Stars: 8
Description: A schientific fiction
*****Movie*****
Title: Trigun
Type: Anime, Action
Format: DVD
Rating: PG
Stars: 10
Description: Vash the Stampede!
*****Movie*****
Title: Ishtar
Type: Comedy
Format: VHS
Rating: PG
Stars: 2
Description: Viewable boredom
For complete SAX API documentation, see Python SAX APIs
Parsing XML with xml.dom
Document Object Model The XML Model (DOM) is a standard programming interface for processing Extensible Markup Language (XML) recommended by the W3C.
When parsing an XML document, a DOM parser reads the entire document at once and stores all its elements in a tree structure in memory. You can then use the various functions provided by the DOM to read or modify the document’s content and structure, and you can also write modified content to an XML file.
Use xml.dom.minidom in Python to parse an XML file. The following example shows this:
#!/usr/bin/python3
from xml.dom.minidom import parse
import xml.dom.minidom
# Open the XML document using the minidom parser
DOMTree = xml.dom.minidom.parse("movies.xml")
collection = DOMTree.documentElement
if collection.hasAttribute("shelf"):
print ("Root element: %s" % collection.getAttribute("shelf"))
# Get all movies in the collection
movies = collection.getElementsByTagName("movie")
# Print detailed information for each movie
for movie in movies:
print ("****Movie********")
if movie.hasAttribute("title"):
print ("Title: %s" % movie.getAttribute("title"))
type = movie.getElementsByTagName('type')[0]
print ("Type: %s" % type.childNodes[0].data)
format = movie.getElementsByTagName('format')[0]
print ("Format: %s" % format.childNodes[0].data)
rating = movie.getElementsByTagName('rating')[0]
print ("Rating: %s" % rating.childNodes[0].data)
description = movie.getElementsByTagName('description')[0]
print ("Description: %s" % description.childNodes[0].data)
The execution results of the above program are as follows:
Root element : New Arrivals
*****Movie*****
Title: Enemy Behind
Type: War, Thriller
Format: DVD
Rating: PG
Description: Talk about a US-Japan war
*****Movie*****
Title: Transformers
Type: Anime, Science Fiction
Format: DVD
Rating: R
Description: A brilliant fiction
*****Movie*****
Title: Trigun
Type: Anime, Action
Format: DVD
Rating: PG
Description: Vash the Stampede!
*****Movie*****
Title: Ishtar
Type: Comedy
Format: VHS
Rating: PG
Description: Viewable boredom
For complete DOM API documentation, please see Python DOM APIs.