Python 3 – XML Processing
Python 3 – XML Processing
XML is a portable, open-source language that allows programmers to develop applications that can be read by other applications regardless of the operating system and/or development language.
What is XML?
Extensible Markup Language (XML) is a markup language similar to HTML or SGML. It is recommended by the World Wide Web Consortium and provided as an open standard.
XML is very useful for tracking small to moderate amounts of data without requiring a SQL-based backbone.
XML Parser Architecture and API
The Python standard library provides a minimal but useful set of interfaces for processing XML.
The two most basic and widely used APIs for processing XML data are the SAX and DOM interfaces.
- Simple API for XML (SAX) − Here, you register callbacks for events of interest and then let the parser progress through the document. When your document is large or memory is limited, it parses the file as it’s read from disk; the entire file is never stored in memory.
-
Document Object Model (DOM) API − This is an approach recommended by the World Wide Web Consortium, where the entire file is read into memory and stored in a hierarchical (tree-based) form to represent all the characteristics of an XML document.
When processing large files, SAX obviously can’t process information as quickly as DOM. On the other hand, using DOM on many small files can be a real drain on your resources.
SAX is read-only, while DOM allows you to modify XML files. Since these two different APIs complement each other, there’s no reason you can’t use them both in large projects.
For all our XML code examples, let’s use a simple XML file, movies.xml, as input.
<collection shelf = "New Arrivals">
<movie title = "Enemy Behind">
<type>War, Thriller</type>
<format>DVD</format>
<year>2003</year>
<rating>PG</rating>
<stars>10</stars>
<description>A conversation about the US-Japan War</description>
</movie>
<movie title = "Transformers">
<type>Anime, Science Fiction</type>
<format>DVD</format>
<year>1989</year>
<rating>R</rating>
<stars>8</stars>
<description>Science Fiction</description>
</movie>
<movie title = "Trigun">
<type>Anime, Action</type>
<format>DVD</format>
<episodes>4</episodes>
<rating>PG</rating>
<stars>10</stars>
<description>The Vichygan story! </description>
</movie>
<movie title = "Ishtar">
<type>Comedy</type>
<format>VHS</format>
<rating>PG</rating>
<stars>2</stars>
<description>Watchably boring</description>
</movie>
</collection>
Parsing XML with the SAX API
SAX is a standard interface for event-driven XML parsing. Parsing XML with SAX typically requires creating your own content handler by subclassing xml.sax.ContentHandler.
Your ContentHandler handles the tags and attributes specific to your XML style. The ContentHandler object provides methods for handling various parsing events. The owning parser calls ContentHandler methods while parsing an XML file.
The methods startDocument and endDocument are called at the beginning and end of an XML file. The method characters(text) is passed the character data of the XML file as the argument text.
The ContentHandler object is called at the beginning and end of each element. If the parser is not in namespace mode, the methods startElement(tag, attributes) and endElement(tag) are called; otherwise, the corresponding methods startElementNS and endElementNS are called. Here, tag is the element tag, and attributes is an Attributes object.
Here are some other important methods to know about before proceeding –
make_parser Method
The following method creates a new parser object and returns it. The parser object created will be of the first parser type found.
xml.sax.make_parser( [parser_list] )
Here are the parameter details –
- parser_list – Optional parameter consisting of a list of parsers that implement the make_parser method.
parse Method
The following method creates a SAX parser and uses it to parse a document.
xml.sax.parse(xmlfile, contenthandler[, errorhandler])
Following are the parameter details –
- xmlfile – This is the name of the XML file to be read.
-
contenthandler – Must be a ContentHandler object.
-
errorhandler – If specified, errorhandler must be a SAX ErrorHandler object.
parseString Method
There is also a method to create a SAX parser and parse the specified XML string.
xml.sax.parseString(xmlstring, contenthandler[, errorhandler])
Below are the parameter details –
- xmlstring – This is the name of the XML string to be read from.
-
contenthandler – Must be a ContentHandler object.
-
errorhandler – If specified, errorhandler must be a SAX ErrorHandler object.
Example
#!/usr/bin/python3
import xml.sax
class MovieHandler(xml.sax.ContentHandler):
def __init__(self):
self.CurrentData = ""
self.type = ""
self.format = ""
self.year = ""
self.rating = ""
self.stars = ""
self.description = ""
# Called when an element starts
def startElement(self, tag, attributes):
self.CurrentData = tag
if tag == "movie":
print ("********电影********")
title = attributes["title"]
print ("Title:", title)
# Called when an element ends
def endElement(self, tag):
if self.CurrentData == "type":
print ("Type:", self.type)
elif self.CurrentData == "format":
print ("format:", self.format)
elif self.CurrentData == "year":
print ("Year:", self.year)
elif self.CurrentData == "rating":
print ("rating:", self.rating)
elif self.CurrentData == "stars":
print ("Starring:", self.stars)
elif self.CurrentData == "description":
print ("Introduction:", self.description)
self.CurrentData = ""
# Called when reading a character
def characters(self, content):
if self.CurrentData == "type":
self.type = content
elif self.CurrentData == "format":
self.format = content
elif self.CurrentData == "year":
self.year = content
elif self.CurrentData == "rating":
self.rating = content
elif self.CurrentData == "stars":
self.stars = content
elif self.CurrentData == "description":
self.description = content
if ( __name__ == "__main__"):
# Create an XMLReader
parser = xml.sax.make_parser()
# Disable namespaces
parser.setFeature(xml.sax.handler.feature_namespaces, 0)
# Override the default ContextHandler
Handler = MovieHandler()
parser.setContentHandler( Handler )
parser.parse("movies.xml")
Output
This will produce the following output −
********Movie********
Title: Enemy Behind
Genre: War, Thriller
Format: DVD
Year: 2003
Rating: PG
Starring: 10
Synopsis: Talking about the US-Japan War
*****Movie*****
Title: Transformers
Genre: Animation, Sci-Fi
Format: DVD
Year: 1989
Rating: R
Starring: 8
Synopsis: Science Fiction
*****Movie*****
Title: Special Forces
Genre: Animation, Action
Format: DVD
Rating: PG
Starring: 10
Synopsis: Vishal Price!
*****Movie*****
Title: Ishtar
Genre: Comedy
Format: VHS
Rating: PG
Starring: 2
Synopsis: A watchable, boring film
For complete details on the SAX API documentation, see the standard Python SAX APIs .
Parsing XML with DOM APIs
The Document Object Model (“DOM”) is a cross-language API provided by the World Wide Web Consortium (W3C) for accessing and modifying XML documents.
DOM is useful for random access applications. SAX allows you to view only a portion of a document at a time. If you are viewing one SAX element, you cannot access another.
This is the simplest way to quickly load an XML document and create a minidom object using the xml.dom module. The minidom object provides a simple parser method that can quickly create a DOM tree from an XML file.
The example phrase calls the parse(file[, parser]) function of the minidom object to parse the XML file specified by file into a DOM tree object.
Example “`
#!/usr/bin/python3
from xml.dom.minidom import parse
import xml.dom.minidom
# Open the XML document using the minidom parser
DOMTree = xml.dom.minidom.parse("movies.xml")
collection = DOMTree.documentElement
if collection.hasAttribute("shelf"):
print ("Root element: %s" % collection.getAttribute("shelf"))
# Get all movies in the collection
movies = collection.getElementsByTagName("movie")
# Print details about each movie
for movie in movies:
print ("******** Movie*****")
if movie.hasAttribute("title"):
print ("Title: %s" % movie.getAttribute("title"))
type = movie.getElementsByTagName('type')[0]
print ("Type: %s" % type.childNodes[0].data)
format = movie.getElementsByTagName('format')[0]
print ("Format: %s" % format.childNodes[0].data)
rating = movie.getElementsByTagName('rating')[0]
print ("Rating: %s" % rating.childNodes[0].data)
description = movie.getElementsByTagName('description')[0]
print ("Description: %s" % description.childNodes[0].data)
Output
Run the above code, the output is as follows:
Root element: New Arrivals
*****Movie*****
Title: Enemy Behind
Type: War, Thriller
Format: DVD
Rating: PG
Description: Talk about a US-Japan war
*****Movie*****
Title: Transformers
Genre: Anime, Science Fiction
Format: DVD
Rating: R
Description: A scientific fiction story
*****Movie*****
Title: Trigun
Genre: Anime, Action
Format: DVD
Rating: PG
Description: Vash the Stampede!
*****Movie*****
Title: Ishtar
Genre: Comedy
Format: VHS
Rating: PG
Description: Viewable boredom
For more information on the DOM API documentation, see the standard Python DOM APIs .