Python3 XML parsing

Python 3 XML Parsing


What is XML?

XML stands for Extensible Markup Language (eXtensible Markup Language), a subset of Standard Generalized Markup Language (SGML). It is a markup language used to structure electronic documents.
You can learn XML tutorials on this site.

XML is designed for transmitting and storing data.

XML is a set of rules for defining semantic markup that divides documents into components and identifies these components.

It is also a meta-markup language, meaning it defines a syntax for defining other domain-specific, semantic, and structured markup languages.


Python Parsing XML

Common XML programming interfaces include DOM and SAX. These two interfaces process XML files differently and are therefore used in different contexts.

Python has three methods for parsing XML: SAX, DOM, and ElementTree:

1. SAX (simple API for XML)

Python The standard library includes a SAX parser. SAX uses an event-driven model to parse XML. The XML file is processed by triggering events and calling user-defined callback functions.

2. DOM (Document Object Model)

Parses XML data into a tree in memory, and manipulates the XML by operating on the tree.

The XML example file movies.xml used in this chapter has the following content:

<collection shelf="New Arrivals">
<movie title="Enemy Behind">
   <type>War, Thriller</type>
   <format>DVD</format>
   <year>2003</year>
   <rating>PG</rating>
   <stars>10</stars>
   <description>Talk about a US-Japan war</description>
</movie>
<movie title="Transformers">
   <type>Anime, Science Fiction</type>
   <format>DVD</format>
   <year>1989</year>
   <rating>R</rating>
   <stars>8</stars>
   <description>A schientific fiction</description>
</movie>
   <movie title="Trigun">
   <type>Anime, Action</type>
<format>DVD</format>
<episodes>4</episodes>
<rating>PG</rating>
<stars>10</stars>
<description>Vash the Stampede!</description>
</movie>
<movie title="Ishtar">
<type>Comedy</type>
<format>VHS</format>
<rating>PG</rating>
<stars>2</stars>
<description>Viewable boredom</description>
</movie>
</collection>

Parsing XML with SAX in Python

SAX is an event-driven API.

Parsing XML documents with SAX involves two parts: a parser and an event handler.

The parser is responsible for reading XML documents and sending events, such as element start and end events, to event handlers.

The event handlers are responsible for responding to these events and processing the XML data.

  • 1. Processing large files;
  • 2. Requires only a portion of the file’s contents or specific information.
  • 3. Requires building your own object model.

To process XML in Python using SAX, you must first import the parse function from xml.sax and the ContentHandler from xml.sax.handler.

ContentHandler Class Method Introduction

characters(content) Method

When to Call:

From the beginning of the line, before encountering a tag, if there are characters, the value of content is the string.

From a tag, before encountering the next tag, if there are characters, the value of content is the string.

From a tag, before encountering the end of the line, if there are characters, the value of content is the string.

A tag can be either a start tag or an end tag.

startDocument() Method

Called when a document starts.

endDocument() Method

Called when the parser reaches the end of the document.

startElement(name, attrs) Method

Called when an XML start tag is encountered. name is the tag name, and attrs is a dictionary of attribute values for the tag.

endElement(name) Method

Called when an XML end tag is encountered.


make_parser Method

The following method creates and returns a new parser object.

xml.sax.make_parser([parser_list])

Parameter Description:

  • parser_list
    Optional parameter, parser list

parser Method

The following method creates a SAX parser and parses an XML document:

xml.sax.parse(xmlfile, contenthandler[, errorhandler])

Parameter Description:

  • xmlfile
    XML file name
  • contenthandler
    Must be a ContentHandler object
  • errorhandler
    If this parameter is specified, errorhandler must be a SAX ErrorHandler object.

parseString Method

The parseString method creates an XML parser and parses an XML string:

xml.sax.parseString(xmlstring, contenthandler[, errorhandler])

Parameter Description:

  • xmlstring
    xmlstring
  • contenthandler
    Must be a ContentHandler object.
  • errorhandler
    If this parameter is specified, errorhandler must be a SAX ErrorHandler object.

Python XML Parsing Example

#!/usr/bin/python3

import xml.sax

class MovieHandler(xml.sax.ContentHandler):
   def __init__(self):
      self.CurrentData = ""
      self.type = ""
      self.format = ""
      self.year = ""
      self.rating = ""
      self.stars = ""
      self.description = ""

   # Element starts calling
   def startElement(self, tag, attributes):
      self.CurrentData = tag
      if tag == "movie":
         print ("****Movie*****")
         title = attributes["title"]
         print ("Title:", title)

   # End of element call
   def endElement(self, tag):
      if self.CurrentData == "type":
         print ("Type:", self.type)
      elif self.CurrentData == "format":
         print ("Format:", self.format)
      elif self.CurrentData == "year":
         print ("Year:", self.year)
      elif self.CurrentData == "rating":
         print ("Rating:", self.rating)
      elif self.CurrentData == "stars":
         print ("Stars:", self.stars)
      elif self.CurrentData == "description":
         print ("Description:", self.description)
      self.CurrentData = ""

   # Called when reading characters
   def characters(self, content):
      if self.CurrentData == "type":
         self.type = content
      elif self.CurrentData == "format":
         self.format = content
      elif self.CurrentData == "year":
         self.year = content
      elif self.CurrentData == "rating":
         self.rating = content
      elif self.CurrentData == "stars":
         self.stars = content
      elif self.CurrentData == "description":
self.description = content

if ( __name__ == "__main__"):

# Create an XMLReader
parser = xml.sax.make_parser()
# Disable namespaces
parser.setFeature(xml.sax.handler.feature_namespaces, 0)

# Override the ContextHandler
Handler = MovieHandler()
parser.setContentHandler( Handler )

parser.parse("movies.xml")

The above code produces the following results:

*****Movie*****
Title: Enemy Behind
Type: War, Thriller
Format: DVD
Year: 2003
Rating: PG
Stars: 10
Description: Talk about a US-Japan war
*****Movie*****
Title: Transformers
Type: Anime, Science Fiction
Format: DVD
Year: 1989
Rating: R
Stars: 8
Description: A schientific fiction
*****Movie*****
Title: Trigun
Type: Anime, Action
Format: DVD
Rating: PG
Stars: 10
Description: Vash the Stampede!
*****Movie*****
Title: Ishtar
Type: Comedy
Format: VHS
Rating: PG
Stars: 2
Description: Viewable boredom

For complete SAX API documentation, see Python SAX APIs


Parsing XML with xml.dom

Document Object Model The XML Model (DOM) is a standard programming interface for processing Extensible Markup Language (XML) recommended by the W3C.

When parsing an XML document, a DOM parser reads the entire document at once and stores all its elements in a tree structure in memory. You can then use the various functions provided by the DOM to read or modify the document’s content and structure, and you can also write modified content to an XML file.

Use xml.dom.minidom in Python to parse an XML file. The following example shows this:

#!/usr/bin/python3

from xml.dom.minidom import parse
import xml.dom.minidom

# Open the XML document using the minidom parser
DOMTree = xml.dom.minidom.parse("movies.xml")
collection = DOMTree.documentElement
if collection.hasAttribute("shelf"):
print ("Root element: %s" % collection.getAttribute("shelf"))

# Get all movies in the collection
movies = collection.getElementsByTagName("movie")

# Print detailed information for each movie
for movie in movies:
print ("****Movie********")
if movie.hasAttribute("title"):
print ("Title: %s" % movie.getAttribute("title"))

   type = movie.getElementsByTagName('type')[0]
   print ("Type: %s" % type.childNodes[0].data)
   format = movie.getElementsByTagName('format')[0]
   print ("Format: %s" % format.childNodes[0].data)
   rating = movie.getElementsByTagName('rating')[0]
   print ("Rating: %s" % rating.childNodes[0].data)
   description = movie.getElementsByTagName('description')[0]
   print ("Description: %s" % description.childNodes[0].data)

The execution results of the above program are as follows:

Root element : New Arrivals
*****Movie*****
Title: Enemy Behind
Type: War, Thriller
Format: DVD
Rating: PG
Description: Talk about a US-Japan war
*****Movie*****
Title: Transformers
Type: Anime, Science Fiction
Format: DVD
Rating: R
Description: A brilliant fiction
*****Movie*****
Title: Trigun
Type: Anime, Action
Format: DVD
Rating: PG
Description: Vash the Stampede!
*****Movie*****
Title: Ishtar
Type: Comedy
Format: VHS
Rating: PG
Description: Viewable boredom

For complete DOM API documentation, please see Python DOM APIs.

Leave a Reply

Your email address will not be published. Required fields are marked *