Python Digital Forensics: Using Email for Investigation

Digital Forensics with Python: Leveraging Email for Investigations

Previous chapters discussed the importance and process of network forensics, as well as the concepts involved. In this chapter, let’s understand the role of email in digital forensics and investigate using Python.

The Role of Email in Investigations

Email plays a crucial role in business communications and has become one of the most important applications on the internet. It is a convenient way to send messages and files not only from computers but also from other electronic devices such as mobile phones and tablets.

The negative side of email is that criminals can potentially leak valuable information about their companies. Consequently, the role of email in digital forensics has increased in recent years. Email is considered crucial evidence in digital forensics, and email header analysis has become a crucial tool in collecting evidence during the investigation process.

Investigators have the following goals when conducting email forensics:

  • Identify the key offender
  • Collect necessary evidence
  • Present the findings
  • Build the case

Challenges of Email Forensics

Email forensics plays a crucial role in investigations, as much communication today relies on email. However, email forensics investigators may face the following challenges during their investigations.

Fake Emails

The biggest challenge in email forensics is the use of fake emails, which are created through manipulation and crafting of headers. Within this category, criminals also use temporary emails, which allow registered users to receive emails at a temporary address that expires after a certain period of time.

Spoofing

Another challenge in email forensics is spoofing, where criminals misrepresent an email as someone else’s. In this case, the machine will receive both the spoofed and original IP addresses.

Anonymous Resending

In this case, the email server strips identifying information from the email before forwarding it further. This creates another significant challenge in email investigations.

Techniques Used in Email Forensic Investigations

Email forensics is the study of the source and content of emails as evidence to determine the actual sender and recipient of the message, as well as other information such as transmission date/time and the sender’s intent. It involves investigating metadata, port scanning, and keyword searches.

Some common techniques used in email forensic investigations are

  • Header Analysis
  • Server Investigation
  • Network Device Investigation
  • Sender Email Fingerprinting
  • Software Embedded Identifiers

In the following sections, we will learn how to use Python to obtain information for email investigation purposes.

Extracting Information from EML Files

EML files are essentially email file formats and are widely used to store email messages. They are structured text files compatible with multiple email clients, such as Microsoft Outlook, Outlook Express, and Windows Live Mail.

EML files store email headers, body content, and attachments in plain text. They use Base64 encoding for binary data and Quoted-Printable (QP) encoding for content information. Below is a Python script that can be used to extract information from EML files.

First, import the following Python libraries, as shown below.

from __future__ import print_function
from argparse import ArgumentParser, FileType
from email import message_from_file

import os
import quopri
import base64

In the above library, quopri is used to decode the QP-encoded value in the EML file. Any Base64-encoded data can be decoded with the help of the base64 library.

Next, let’s provide arguments to the command line handler. Note that it only accepts a single argument: the path to the EML file, as shown below.

if __name__ == '__main__':
parser = ArgumentParser('Extracting information from EML file')
parser.add_argument("EML_FILE",help="Path to EML File", type=FileType('r'))
args = parser.parse_args()
main(args.EML_FILE)

Now, we need to define the main() function, in which we will use a method from the email library called message_from_file() to read a file-like object. Here, we will access the headers, body content, attachments, and other payload information by using a variable named emlfile as shown in the following code.

def main(input_file):
emlfile = message_from_file(input_file)
for key, value in emlfile._headers:
print("{}: {}".format(key, value))
print("nBodyn")

if emlfile.is_multipart():
for part in emlfile.get_payload():
process_payload(part)
else:
process_payload(emlfile[1])

Now, we need to define the process_payload() method, in which we will use the get_payload() method to extract the message body content. We will use the quopri.decodestring() function to decode the QP encoded data. We will also check the MIME type of the content so that it can handle the storage of the email correctly. Observe the code below –

def process_payload(payload):
   print(payload.get_content_type() + "n" + "=" * len(payload.get_content_type()))
   body = quopri.decodestring(payload.get_payload())

   if payload.get_charset():
      body = body.decode(payload.get_charset())
else:
   try:
      body = body.decode()
   except UnicodeDecodeError:
      body = body.decode('cp1252')

if payload.get_content_type() == "text/html":
   outfile = os.path.basename(args.EML_FILE.name) + ".html"
   open(outfile, 'w').write(body)
elif payload.get_content_type().startswith('application'):
   outfile = open(payload.get_filename(), 'wb')
   body = base64.b64decode(payload.get_payload())
   outfile.write(body)
   outfile.close()
   print("Exported: {}n".format(outfile.name))
else: print(body)

After executing the above script, we will get the header information and various payloads in the console.

Analyzing MSG Files with Python

Email messages come in many different formats. MSG is the format used by Microsoft Outlook and Exchange. Files with the MSG extension may contain plain ASCII text headers and the main message body, as well as hyperlinks and attachments.

In this section, we will learn how to extract information from MSG files using the Outlook API. Please note that the following Python script only works on Windows. To do this, we need to install a third-party Python library called pywin32 as shown below.

pip install pywin32

Now, import the following libraries using the commands shown −

from __future__ import print_function
from argparse import ArgumentParser

import os
import win32com.client
import pywintypes

Now, let’s provide an argument to the command line processor. Here, it will accept two arguments: the path to the MSG file and the desired output folder, as shown below.

if __name__ == '__main__':

parser = ArgumentParser('Extracting information from MSG file')

parser.add_argument("MSG_FILE", help="Path to MSG file")

parser.add_argument("OUTPUT_DIR", help="Path to output folder")

args = parser.parse_args()

out_dir = args.OUTPUT_DIR

if not os.path.exists(out_dir):

os.makedirs(out_dir)

main(args.MSG_FILE, args.OUTPUT_DIR)

Now, we need to define the main() function, in which we will call win32com The Outlook API library is used to set up the Outlook API, which further allows access to the MAPI namespace.

def main(msg_file, output_dir):
mapi = win32com.client.Dispatch("Outlook.Application").GetNamespace("MAPI")
msg = mapi.OpenSharedItem(os.path.abspath(args.MSG_FILE))

display_msg_attribs(msg)
display_msg_recipients(msg)

extract_msg_body(msg, output_dir)
extract_attachments(msg, output_dir)

Now, define the various functions we’ll use in this script. The code given below shows the definition of the display_msg_attribs() function, which allows us to display various attributes of the message such as Subject, To, Bcc, Cc, Size, Sender’s Name, Sent, etc.

def display_msg_attribs(msg):
   attribs = [
      'Application', 'AutoForwarded', 'BCC', 'CC', 'Class',
      'ConversationID', 'ConversationTopic', 'CreationTime',
      'ExpiryTime', 'Importance', 'InternetCodePage', 'IsMarkedAsTask',
      'LastModificationTime', 'Links','ReceivedTime', 'ReminderSet',
      'ReminderTime', 'ReplyRecipientNames', 'Saved', 'Sender',
      'SenderEmailAddress', 'SenderEmailType', 'SenderName', 'Sent',
      'SentOn', 'SentOnBehalfOfName', 'Size', 'Subject',
'TaskCompletedDate', 'TaskDueDate', 'To', 'UnRead'
]
print("nMessage Attributes")
for entry in attribs:
print("{}: {}".format(entry, getattr(msg, entry, 'N/A')))

Now, define the display_msg_recipeints() function to iterate through the message and display the recipient details.

def display_msg_recipients(msg):
recipient_attrib = ['Address', 'AutoResponse', 'Name', 'Resolved', 'Sendable']
i = 1

while True:
try:
recipient = msg.Recipients(i)
except pywintypes.com_error:
break
print("nRecipient {}".format(i))
print("=" * 15)

for entry in recipient_attrib:
print("{}: {}".format(entry, getattr(recipient, entry, 'N/A')))
i += 1

Next, we define the extract_msg_body() function to extract the body content from the email, including HTML and plain text.

def extract_msg_body(msg, out_dir):
   html_data = msg.HTMLBody.encode('cp1252')
   outfile = os.path.join(out_dir, os.path.basename(args.MSG_FILE))

   open(outfile + ".body.html", 'wb').write(html_data)
   print("Exported: {}".format(outfile + ".body.html"))
   body_data = msg.Body.encode('cp1252')

   open(outfile + ".body.txt", 'wb').write(body_data)
   print("Exported: {}".format(outfile + ".body.txt"))

Next, we will define extract_attachments() Function to export attachment data to the desired output directory.

def extract_attachments(msg, out_dir):
attachment_attribs = ['DisplayName', 'FileName', 'PathName', 'Position', 'Size']
i = 1 # Attachments start at 1

while True:
try:
attachment = msg.Attachments(i)
except pywintypes.com_error:
break

Once all functions are defined, we print all attributes to the console with the following line of code –

print("nAttachment {}".format(i))
print("=" * 15)

for entry in attachment_attribs:
print('{}: {}'.format(entry, getattr(attachment, entry,"N/A")))
outfile = os.path.join(os.path.abspath(out_dir),os.path.split(args.MSG_FILE)[-1])

if not os.path.exists(outfile):
os.makedirs(outfile)
outfile = os.path.join(outfile, attachment.FileName)
attachment.SaveAsFile(outfile)

print("Exported: {}".format(outfile))
i += 1

After running the above script, we will get the properties of the message and its attachments in the console window, as well as several files in the output directory.

Building MBOX Files from Google Takeout Using Python

MBOX files are text files with a special format that stores information in separate sections. They are often found in connection with UNIX systems, Thunderbolt, and Google Takeout.

In this section, you’ll see a Python script that will structure the MBOX files we receive from Google Takeouts. But before that, we need to know how to generate these MBOX files using either a Google Account or a Gmail account.

Getting Google Account Emails in MBX Format

Getting Google Account Emails means backing up our Gmail accounts. This can be useful for a variety of personal or professional reasons. Note that Google offers backups of Gmail data. To get Google Account Emails in MBOX format, you’ll need to follow these steps:

  • Open the My Account dashboard.
  • Go to the Personal Information & Privacy section and select the Control Your Content link.

  • You can create a new archive or manage existing archives. If we click the “Create Archive” link, we’ll be presented with checkboxes for each Google product we want to include.

  • After selecting the product, we can freely choose the file type and maximum archive size, as well as select the delivery method from a list.

  • Finally, we will have a backup in MBOX format.

Python Code

Now, the MBOX file discussed above can be structured in Python, as shown below.

First, we need to import Python libraries as shown below.

from __future__ import print_function
from argparse import ArgumentParser

import mailbox
import os
import time
import csv
from tqdm import tqdm

import base64

Except for the mailbox library, which is used to parse MBOX files, all libraries have been used and explained in the previous script.

Now, provide an argument to the command-line handler. It will accept two arguments – the path to the MBOX file and the desired output folder.

if __name__ == '__main__':
parser = ArgumentParser('Parsing MBOX files')
parser.add_argument("MBOX", help="Path to mbox file")
parser.add_argument(
"OUTPUT_DIR", help = "Path to output directory to write report "" and exported content")
args = parser.parse_args()
main(args.MBOX, args.OUTPUT_DIR)

Now, we will define the main() function and call the mbox class from the Mailbox library. With its help, we can parse the MBOX file by providing its path.

def main(mbox_file, output_dir):
print("Reading mbox file")
mbox = mailbox.mbox(mbox_file, factory=custom_reader)
print("{} messages to parse".format(len(mbox)))

Now, define a reader method for the mailbox library as shown below.

def custom_reader(data_stream):
data = data_stream.read()
try:
content = data.decode("ascii")
except (UnicodeDecodeError, UnicodeEncodeError) as e:
content = data.decode("cp1252", errors="replace")
return mailbox.mboxMessage(content)

Now, create some variables for further processing as shown below.

parsed_data = []
attachments_dir = os.path.join(output_dir, "attachments")

if not os.path.exists(attachments_dir):
os.makedirs(attachments_dir)
columns = [
"Date", "From", "To", "Subject", "X-Gmail-Labels", "Return-Path", "Received",
"Content-Type", "Message-ID","X-GM-THRID", "num_attachments_exported", "export_path"]

Next, use tqdm to generate a progress bar and track the iterations, as shown below.

for message in tqdm(mbox):
msg_data = dict()
header_data = dict(message._headers)
for hdr in columns:
msg_data[hdr] = header_data.get(hdr, "N/A")

Now, we check if the message has a payload. If so, we define the write_payload() method as shown below.

if len(message.get_payload()):
export_path = write_payload(message, attachments_dir)
msg_data['num_attachments_exported'] = len(export_path)
msg_data['export_path'] = ", ".join(export_path)

Now, we need to append the data. We will then call the create_report() method as shown below

parsed_data.append(msg_data)
create_report(
   parsed_data, os.path.join(output_dir, "mbox_report.csv"), columns)
def write_payload(msg, out_dir):
   pyld = msg.get_payload()
   export_path = []

if msg.is_multipart():
   for entry in pyld:
      export_path += write_payload(entry, out_dir)
else:
   content_type = msg.get_content_type()
   if "application/" in content_type.lower():
      content = base64.b64decode(msg.get_payload())
      export_path.append(export_content(msg, out_dir, content))
   elif "image/" in content_type.lower():
      content = base64.b64decode(msg.get_payload())
      export_path.append(export_content(msg, out_dir, content))

   elif "video/" in content_type.lower():
      content = base64.b64decode(msg.get_payload())
      export_path.append(export_content(msg, out_dir, content))
   elif "audio/" in content_type.lower():
      content = base64.b64decode(msg.get_payload())
      export_path.append(export_content(msg, out_dir, content))
   elif "text/csv" in content_type.lower():
      content = base64.b64decode(msg.get_payload())
      export_path.append(export_content(msg, out_dir, content))   elif "info/" in content_type.lower():
      export_path.append(export_content(msg, out_dir,
      msg.get_payload()))
   elif "text/calendar" in content_type.lower():
      export_path.append(export_content(msg, out_dir,
      msg.get_payload()))
   elif "text/rtf" in content_type.lower():
      export_path.append(export_content(msg, out_dir,
      msg.get_payload()))
   else:
      if "name=" in msg.get('Content-Disposition', "N/A"):
         content = base64.b64decode(msg.get_payload())
      export_path.append(export_content(msg, out_dir, content))
   elif "name=" in msg.get('Content-Type', "N/A"):
content = base64.b64decode(msg.get_payload())
export_path.append(export_content(msg, out_dir, content))
return export_path

Observe that the above if-else statement is easy to understand. Now, we need to define a method to extract the file name from the msg object, as shown below.

def export_content(msg, out_dir, content_data):
file_name = get_filename(msg)
file_ext = "FILE"

if "." in file_name: file_ext = file_name.rsplit(".", 1)[-1]
file_name = "{}_{:.4f}.{}".format(file_name.rsplit(".", 1)[0], time.time(), file_ext)
file_name = os.path.join(out_dir, file_name)

Now, with the help of the following lines of code, you can actually export the file –

if isinstance(content_data, str):
open(file_name, 'w').write(content_data)
else:
open(file_name, 'wb').write(content_data)
return file_name

Now, let’s define a function to extract the file names from the message to accurately represent the names of these files, as shown below

def get_filename(msg):
if 'name=' in msg.get("Content-Disposition", "N/A"):
fname_data = msg["Content-Disposition"].replace("rn", " ")
fname = [x for x in fname_data.split("; ") if 'name=' in x]
file_name = fname[0].split("=", 1)[-1]
elif 'name=' in msg.get("Content-Type", "N/A"):
fname_data = msg["Content-Type"].replace("rn", " ")
fname = [x for x in fname_data.split("; ") if 'name=' in x]
file_name = fname[0].split("=", 1)[-1]
else:
file_name = "NO_FILENAME"
fchars = [x for x in file_name if x.isalnum() or x.isspace() or x == "."]
return "".join(fchars)

Now, we can write a CSV file by defining the create_report( ) function as shown below.

def create_report(output_data, output_file, columns):
with open(output_file, 'w', newline="") as outfile:
csvfile = csv.DictWriter(outfile, columns)
csvfile.writeheader()
csvfile.writerows(output_data)

Once you run the script above, we’ll get a CSV report and a directory full of attachments.

Leave a Reply

Your email address will not be published. Required fields are marked *