Parsing XML files in Python

Parsing XML Files with Python First, we parse an XML (Extensible Markup Language) file to obtain the raw latitude and longitude data. This process demonstrates how to encapsulate some of Python’s less functional language features to generate an iterable sequence.

Using the xml.etree module, the ElementTree object returned after parsing is used to traverse the processed data using the findall() method.

The XML code for the target data to be processed is as follows:

<Placemark><Point>
<coordinates>-76.33029518659048,
37.54901619777347,0</coordinates>
</Point></Placemark>

This file contains multiple <Placemark> tags, each of which contains a point and coordinate structure. This is the typical format of a KML (Keyhole Markup Language) file containing geographic location information.

The method for parsing an XML file can be abstracted into two levels: the bottom-level method is responsible for locating various tags, attribute values, and document content, while the top-level method is responsible for extracting useful objects from the text and attribute values.

The underlying processing method is as follows:

import xml.etree.ElementTree as XML
from typing import Text, List, TextIO, Iterable
def row_iter_kml(file_obj: TextIO) -> Iterable[List[Text]]:
    ns_map = {
        "ns0": "http://www.opengis.net/kml/2.2",
        "ns1": "http://www.google.com/kml/ext/2.2"}
    path_to_points= ("./ns0:Document/ns0:Folder/ns0:Placemark/"
          "ns0:Point/ns0:coordinates")
    doc = XML.parse(file_obj)
    return (comma_split(Text(coordinates.text))
            for coordinates in
            doc.findall(path_to_points, ns_map))

This function takes the text of the file object in the with statement as input and returns a generator based on latitude and longitude pairs, which is used to generate a list object containing the data. When parsing an XML file, this function contains a simple static dictionary object and an ns_map object, which provide namespace mapping information for the XML tags being searched. The dictionary object is used by the ElementTree.findall() method, which processes the XML file.

The main body of the parsing is a generator function that uses the doc.findall() method to locate a series of tags. These tags are used as arguments to the comma_split() function, which converts the target text into a series of comma-separated values.

Comma_split() is the function version of the string split() method, implemented as follows:

def comma_split(text: Text) -> List[Text]:
return text.split(",")

By wrapping object methods as prefix functions, we ensure a consistent language style. Furthermore, by adding a type specifier, we explicitly indicate that the function converts text into a list of text values. Without the type specifier, there would be two different potential implementations of split(): one for splitting a byte array and one for splitting a string. In Python 3, the Text type is an alias for str.

The function returns an iterable sequence of rows, each containing three strings representing the longitude, latitude, and altitude of a point on the path. At this point, the data is still unusable; we need to extract the longitude and latitude values and convert them to floating-point numbers.

The underlying parsing method converts the raw data into an iterable tuple (or sequence), allowing us to process data files in a relatively simple and consistent manner. Chapter 3 describes how to convert a CSV (comma-separated values) file into a sequence of tuples. Chapter 6 will discuss this topic in detail, introducing different parsing methods.

The parsing results of the above function are as follows:

[['-76.33029518659048', '37.54901619777347', '0'],
['-76.27383399999999', '37.840832', '0'],
['-76.459503', '38.331501', '0'],
etc.
['-76.47350299999999', '38.976334', '0']]

Each line is a comma-delimited list of the <ns0_coordinates> tag text, including east-west longitude, north-south latitude, and altitude. We’ll create functions later to process these calculations to produce usable data.

Leave a Reply

Your email address will not be published. Required fields are marked *