Python Digital Forensics: Investigating Embedded Metadata

Python Digital Forensics: Investigating Embedded Metadata

In this chapter, we will take a detailed look at investigating embedded metadata using Python digital forensics techniques.

Introduction

Embedded metadata is information about data stored in the same file as the object described by that data. In other words, it is information about a digital asset stored within the digital file itself. It is always associated with the file and can never be separated.

In digital forensics, it is impossible to extract all the information about a specific file. On the other hand, embedded metadata can provide us with information that is crucial to an investigation. For example, the metadata of a text file may contain information about the author, its length, the date it was written, and even a brief summary of the file. A digital image may include metadata such as the image’s length and shutter speed.

Artifacts Containing Metadata Attributes and Their Extraction

In this section, we will learn about various artifacts containing metadata attributes and the process of extracting them using Python.

Audio and Video

These are two very common artifacts that have embedded metadata. This metadata can be extracted for investigation.

You can use the following Python script to extract common attributes or metadata from audio or MP3 files and video or MP4 files.

Note that for this script, we need to install a third-party Python library called mutagen, which allows us to extract metadata from audio and video files. It can be installed with the help of the following command:

pip install mutagen

In this Python script, we need to import some useful libraries:

from __future__ import print_function

import argparse
import json
import mutagen

This command-line handler will accept a single argument, which represents the path to an MP3 or MP4 file. We will then use the mutagen.file() method to open a file handle, as shown below.

if __name__ == '__main__':
parser = argparse.ArgumentParser('Python Metadata Extractor')
parser.add_argument("AV_FILE", help="File to extract metadata from")
args = parser.parse_args()
av_file = mutagen.File(args.AV_FILE)
file_ext = args.AV_FILE.rsplit('.', 1)[-1]

if file_ext.lower() == 'mp3':
handle_id3(av_file)
elif file_ext.lower() == 'mp4':
handle_mp4(av_file)

Now, we need two handles: one for extracting data from the MP3 file and one for extracting data from the MP4 file. We can define these handles as follows.

def handle_id3(id3_file):
id3_frames = {'TIT2': 'Title', 'TPE1': 'Artist', 'TALB': 'Album', 'TXXX':
'Custom', 'TCON': 'Content Type', 'TDRL': 'Date released', 'COMM': 'Comments',
'TDRC': 'Recording Date'}
print("{:15} | {:15} | {:38} | {}".format("Frame", "Description","Text","Value"))
print("-" * 85)

for frames in id3_file.tags.values():
frame_name = id3_frames.get(frames.FrameID, frames.FrameID)
      desc = getattr(frames, 'desc', "N/A")
      text = getattr(frames, 'text', ["N/A"])[0]
      value = getattr(frames, 'value', "N/A")

      if "date" in frame_name.lower():
         text = str(text)
      print("{:15} | {:15} | {:38} | {}".format(
         frame_name, desc, text, value))
def handle_mp4(mp4_file):
   cp_sym = u"u00A9"
   qt_tag = {
      cp_sym + 'nam': 'Title', cp_sym + 'art': 'Artist',
      cp_sym + 'alb': 'Album', cp_sym + 'gen': 'Genre',
      'cpil': 'Compilation', cp_sym + 'day': 'Creation Date',
'cnID': 'Apple Store Content ID', 'atID': 'Album Title ID',
'plID': 'Playlist ID', 'geID': 'Genre ID', 'pcst': 'Podcast',
'purl': 'Podcast URL', 'egid': 'Episode Global ID',
'cmID': 'Camera ID', 'sfID': 'Apple Store Country',
'desc': 'Description', 'ldes': 'Long Description'}
genre_ids = json.load(open('apple_genres.json'))

Now, we need to iterate over this MP4 file as shown below

print("{:22} | {}".format('Name', 'Value'))
print("-" * 40)

for name, value in mp4_file.tags.items():
tag_name = qt_tag.get(name, name)

if isinstance(value, list):
value = "; ".join([str(x) for x in value])
if name == 'geID':
value = "{}: {}".format(
value, genre_ids[str(value)].replace("|", " - "))
print("{:22} | {}".format(tag_name, value))

The above script will provide additional information about MP3 and MP4 files.

Images

Images may contain different kinds of metadata, depending on the file format. However, most images have embedded GPS information. We can extract this GPS information using a third-party Python library. You can do the same thing using the following Python script.

First, download the third-party Python library called Python Imaging Library (PIL) as shown below.

pip install pillow

This will help us extract metadata from the image.

We can also write the GPS details embedded in the image into a KML file, but to do this we need to download a third-party Python library called simplekml as shown below.

pip install simplekml

In this script, first we need to import the following libraries—

from __future__ import print_function
import argparse

from PIL import Image
from PIL.ExifTags import TAGS

import simplekml
import sys

Now, the command line handler will accept a positional argument, which essentially represents the file path of the photo.

parser = argparse.ArgumentParser('Metadata from images')
parser.add_argument('PICTURE_FILE', help = "Path to picture")
args = parser.parse_args()

Now, we need to specify the URLs that will populate the coordinate information. These URLs are gmaps and open_maps. We also need a function to convert the degree-minute-second (DMS) coordinate tuples provided by the PIL library to decimal. It can be done as follows

gmaps = "https://www.google.com/maps?q={},{}"
open_maps = "http://www.openstreetmap.org/?mlat={}&mlon={}"

def process_coords(coord):
coord_deg = 0

for count, values in enumerate(coord):
coord_deg += (float(values[0]) / values[1]) / 60**count
return coord_deg

Now, we will open the file as a PIL object using the image.open() function.

img_file = Image.open(args.PICTURE_FILE)
exif_data = img_file._getexif()

if exif_data is None:
print("No EXIF data found")
sys.exit()
for name, value in exif_data.items():
gps_tag = TAGS.get(name, name)
if gps_tag is not 'GPSInfo':
continue

After finding the GPSInfo tag, we store the GPS reference and process the coordinates using the process_coords() method.

lat_ref = value[1] == u'N'
lat = process_coords(value[2])

if not lat_ref:
   lat = lat * -1
lon_ref = value[3] == u'E'
lon = process_coords(value[4])

if not lon_ref:
   lon = lon * -1

Now, launch the kml object from the simplekml library as follows:

kml = simplekml.Kml()
kml.newpoint(name = args.PICTURE_FILE, coords = [(lon, lat)])
kml.save(args.PICTURE_FILE + ".kml")

We can now print the coordinates from the processed information, as shown below.

print("GPS Coordinates: {}, {}".format(lat, lon))
print("Google Maps URL: {}".format(gmaps.format(lat, lon)))
print("OpenStreetMap URL: {}".format(open_maps.format(lat, lon)))
print("KML File {} created".format(args.PICTURE_FILE + ".kml"))

PDF Files

PDF documents contain a variety of media, including images, text, tables, and more. When we extract embedded metadata from a PDF document, we obtain the resulting data in a format called the Extensible Metadata Platform (XMP). We can extract metadata with the help of the following Python code – 1.

First, install a third-party Python library called PyPDF2 to read metadata stored in the XMP format. It can be installed as follows:

pip install PyPDF2

Now, import the following library to extract metadata from the PDF file:

from __future__ import print_function
from argparse import ArgumentParser, FileType

import datetime
from PyPDF2 import PdfFileReader
import sys

Now, the command line handler will accept a positional argument, which essentially represents the file path of the PDF file.

parser = argparse.ArgumentParser('Metadata from PDF')
parser.add_argument('PDF_FILE', help='Path to PDF file',type=FileType('rb'))
args = parser.parse_args()

We can now use the getXmpMetadata() method to provide an object containing the available metadata, as shown below.

pdf_file = PdfFileReader(args.PDF_FILE)
xmpm = pdf_file.getXmpMetadata()

if xmpm is None:
print("No XMP metadata found in document.")
sys.exit()

We can use custom_print() ) method to extract and print relevant values, such as title, creator, contributors, etc., as shown below.

custom_print("Title: {}", xmpm.dc_title)
custom_print("Creator(s): {}", xmpm.dc_creator)
custom_print("Contributors: {}", xmpm.dc_contributor)
custom_print("Subject: {}", xmpm.dc_subject)
custom_print("Description: {}", xmpm.dc_description)
custom_print("Created: {}", xmpm.xmp_createDate)
custom_print("Modified: {}", xmpm.xmp_modifyDate)
custom_print("Event Dates: {}", xmpm.dc_date)

If you use multiple software to create PDFs, you can also define The custom_print( ) method is shown below.

def custom_print(fmt_str, value):
   if isinstance(value, list):
      print(fmt_str.format(", ".join(value)))
   elif isinstance(value, dict):
      fmt_value = [":".join((k, v)) for k, v in value.items()]
      print(fmt_str.format(", ".join(value)))
   elif isinstance(value, str) or isinstance(value, bool):
      print(fmt_str.format(value))
   elif isinstance(value, bytes):
      print(fmt_str.format(value.decode()))
   elif isinstance(value, datetime.datetime):
      print(fmt_str.format(value.isoformat()))
   elif value is None: print(fmt_str.format("N/A"))
else:
print("warn: unhandled type {} found".format(type(value)))

We can also extract any other custom properties stored by the software as shown below.

if xmpm.custom_properties:
print("Custom Properties:")

for k, v in xmpm.custom_properties.items():
print("t{}: {}".format(k, v))

The above script will read a PDF document and print the metadata stored in the XMP format, including some custom properties stored by the software with which the PDF was created.

Windows Executable Files

Sometimes you may encounter a suspicious or unauthorized executable file. However, for investigative purposes, it may be useful because it has embedded metadata. We can obtain its location, purpose, and other properties, such as manufacturer and compilation date. With the help of the following Python script, we can obtain the compilation date and useful data from header files and imported and exported symbols.

To do this, first install the third-party Python library pefile. This can be done by following these steps.

pip install pefile

Once you have successfully installed it, import the following library as shown below.

from __future__ import print_function

import argparse
from datetime import datetime
from pefile import PE

The command line handler will now accept a positional argument, essentially representing the file path to the executable. You can also choose the output style: whether you prefer a detailed, coarse, or simplified output. To do this, you need to provide an optional argument, as shown below.

parser = argparse.ArgumentParser(‘Metadata from executable file’)
parser.add_argument(“EXE_FILE”, help = “Path to exe file”)
parser.add_argument(“-v”, “–verbose”, help = “Increase verbosity of output”,
action = ‘store_true’, default = False)
args = parser.parse_args()

</pre>
<p>Now, we will load the input executable file using the PE class. We will also dump the executable data into a dictionary object using the <strong>dump_dict()</strong> method. </p>
<pre><code class="language-python line-numbers">pe = PE(args.EXE_FILE)
ped = pe.dump_dict()

We can extract basic file metadata such as embedded authorship, version and compilation time using the code shown below −

file_info = {}
for structure in pe.FileInfo:
   if structure.Key == b'StringFileInfo':
      for s_table in structure.StringTable:
         for key, value in s_table.entries.items():
            if value is None or len(value) == 0:
               value = "Unknown"
            file_info[key] = value
print("File Information: ")
print("==================")

for k, v in file_info.items():
   if isinstance(k, bytes):
k = k.decode()
if isinstance(v, bytes):
v = v.decode()
print("{}: {}".format(k, v))
comp_time = ped['FILE_HEADER']['TimeDateStamp']['Value']
comp_time = comp_time.split("[")[-1].strip("]")
time_stamp, timezone = comp_time.rsplit(" ", 1)
comp_time = datetime.strptime(time_stamp, "%a %b %d %H:%M:%S %Y")
print("Compiled on {} {}".format(comp_time, timezone.strip()))

We can extract useful data from the header file as follows

for section in ped['PE Sections']:
print("Section '{}' at {}: {}/{} {}".format(
section['Name']['Value'], hex(section['VirtualAddress']['Value']),
section['Misc_VirtualSize']['Value'],
section['SizeOfRawData']['Value'], section['MD5'])
)

Now, extract the list of imports and exports from the executable, as shown below.

if hasattr(pe, 'DIRECTORY_ENTRY_IMPORT'):
print("nImports: ")
print("=========")

for dir_entry in pe.DIRECTORY_ENTRY_IMPORT:
dll = dir_entry.dll

if not args.verbose:
print(dll.decode(), end=", ")
continue
name_list = []

for impts in dir_entry.imports:
if getattr(impts, "name", b"Unknown") is None:
name = b"Unknown"
else:
name = getattr(impts, "name", b"Unknown")
name_list.append([name.decode(), hex(impts.address)])
name_fmt = ["{} ({})".format(x[0], x[1]) for x in name_list]
print('- {}: {}'.format(dll.decode(), ", ".join(name_fmt)))
if not args.verbose:
print()

Now, use the following code to print exports, name, and address.

if hasattr(pe, 'DIRECTORY_ENTRY_EXPORT'):
print("nExports: ")
print("=========")

for sym in pe.DIRECTORY_ENTRY_EXPORT.symbols:
print('- {}: {}'.format(sym.name.decode(), hex(sym.address)))

The above script will extract basic metadata and information from the header files of Windows executable files.

Office File Metadata

Most computing work is done in the three MS Office applications—Word, PowerPoint, and Excel. These files possess extensive metadata that can reveal interesting information about their authors and history.

Note that the metadata for the 2007 Word (.docx), Excel (.xlsx), and PowerPoint (.pptx) formats is stored in an XML file. We can process these XML files in Python with the help of the following Python script, as shown below.

First, import the required libraries, as shown below.

from __future__ import print_function
from argparse import ArgumentParser
from datetime import datetime as dt
from xml.etree import ElementTree as etree

import zipfile
parser = argparse.ArgumentParser('Office Document Metadata')
parser.add_argument("Office_File", help="Path to office file to read")
args = parser.parse_args()

Now, check if the file is a ZIP file. Otherwise, raise an error. Now, open the file and extract the key elements for processing using the following code −

zipfile.is_zipfile(args.Office_File)
zfile = zipfile.ZipFile(args.Office_File)
core_xml = etree.fromstring(zfile.read('docProps/core.xml'))
app_xml = etree.fromstring(zfile.read('docProps/app.xml'))

Now, create a dictionary to start extracting metadata.

core_mapping = {
   'title': 'Title',
   'subject': 'Subject',
   'creator': 'Author(s)',
   'keywords': 'Keywords',
   'description': 'Description',
   'lastModifiedBy': 'Last Modified By',
   'modified': 'Modified Date',
   'created': 'Created Date',
   'category': 'Category',
   'contentStatus': 'Status',
   'revision': 'Revision'
}

Use the iterchildren() method to access each tag in the XML file:

for element in core_xml.getchildren():
for key, title in core_mapping.items():
if key in element.tag:
if 'date' in title.lower():
text = dt.strptime(element.text, "%Y-%m-%dT%H:%M:%SZ")
else:
text = element.text
print("{}: {}".format(title, text))

Similarly, do the same for the app.xml file that contains statistics about the file contents:

app_mapping = {
'TotalTime': 'Edit Time (minutes)',
'Pages': 'Page Count',
'Words': 'Word Count',
'Characters': 'Character Count',
'Lines': 'Line Count',
'Paragraphs': 'Paragraph Count',
'Company': 'Company',
'HyperlinkBase': 'Hyperlink Base',
'Slides': 'Slide Count',
'Notes': 'Note Count',
'HiddenSlides': 'Hidden Slide Count',
}
for element in app_xml.getchildren():
for key, title in app_mapping.items():
if key in element.tag:
if 'date' in title.lower():
text = dt.strptime(element.text, "%Y-%m-%dT%H:%M:%SZ")
else:
text = element.text
print("{}: {}".format(title, text))

Now, after running the above script, we can get different details about a particular document. Note that we can only apply this script to documents in Office 2007 or later.

Leave a Reply

Your email address will not be published. Required fields are marked *