MailHeaderDump Specification

John A. Thywissen, 17 Mar 2006

Purpose

The MailHeaderDump Specification describes the function of the MailHeaderDump software application product. The specification is intended for end users of the product and people interested in the methods used to gather the data that the product captures.

Scope

This specification applies to version 1.0 of the MailHeaderDump.jar Java archive (JAR) file.

MailHeaderDump collects attributes of mail messages that reside in a user's server-based message store for later analysis. The product does not collect data on a local message store, nor does it "intercept" all traffic flowing through a mailbox. It only takes a "snapshot" of the state of a message store at a particular time. It also does not perform any analysis.

The product is intended to be used as a data collection tool in a study of e-mail behavior.

This product was developed as part of the EMIS 7370, Probability and Statistics for Scientists and Engineers, course at S.M.U. as part of the M.S.E.M. program.

Overview

MailHeaderDump performs one basic function: it connects to a server-based message store, traverses the stores structure, and records attributes of all messages located in the store to a data file.

The product has four interfaces:

The user interface: To gather operation parameters from the end user
The message store access protocol: To access and retrieve messages
The file output format: To record the observed messages' attributes
Platform API: For file I/O and preferences persistence services

User Interface

The user interface gathers operation parameters from the end user. It is in the form of a main window (Java JFrame) with the following components:

Version label: Static text identifying the application and its version.
"Mail server URL" text box: A mail store URL, in the form of "imap://username@mail.example.com". "imap" is the mail server access protocol, either "imap" or "pop". "username" is the username string used to access the message store on the mail server. "mail.example.com" is the hostname of the server on which the message store resides.
"Output file prefix" text box: A file name or file path to prefix to the file name of the output file. MailHeaderDump will add the username, at date-time group, and a ".txt" suffix to the prefix given here. File paths can be specified here, for example "/Users/username/Desktop/MHD-" (UNIX-style) or "D:\Documents and Settings\username\My Documents\MHD-" (Windows-style).
Three "Hash" checkboxes, labeled "Hash addresses", "Hash subjects", and "Hash message IDs": Selecting one of these checkboxes results in that mail message field being obscured by a hash function. The hash function is described below. This preserves the user's privacy, if desired.
"Start" button: Pressing this button causes MailHeaderDump to start its operation (detailed below).

During the connection to the message store, an authentication window will open and request the user's username and password for the message store server.

Message Store Access Protocol

MailHeaderDump 1.0 incorporates JavaMail 1.3.3, and relies upon it to provide message store access services. JavaMail 1.3.3 provides access to message stores via the IMAP4, IMAP4rev1, and POP3 protocols. Note, however, that MailHeaderDump 1.0 has only been tested using the IMAP protocol. For more information on JavaMail, see http://java.sun.com/products/javamail/.

File Output Format

MailHeaderDump writes a single text file, with one message per line, and the TAB character used as a field separator. This file format is intended to be easy to import into Microsoft Excel. The file is written using the platform default encoding and the platform default line separator. This corresponds to Excel's behavior.

Each line in the file contains the following fields:

Message-ID, or the hash of the Message-ID.
Sent date, in yyyy-MM-dd HH:mm:ss format, using UTC (coordinated universal time) time zone.
Received date, in yyyy-MM-dd HH:mm:ss format, using UTC (coordinated universal time) time zone.
Count of "From" addresses (normally one).
Count of "From" addresses that are in the same domain as the mail store. For example, if the mail store is on a server "mail.example.com", this field counts addresses ending in "example.com". This only compares first and second level domains. For example, "a.b.c.d" matches "x.y.c.d".
A comma-seperated list of "From" addresses (RFC 2822 format), or a hash of each address.
Count of "To" addresses.
Count of "To" addresses that are in the same domain as the mail store. This only compares first and second level domains.
A comma-seperated list of "To" addresses (RFC 2822 format), or a hash of each address.
Count of "Cc" addresses.
Count of "Cc" addresses that are in the same domain as the mail store. This only compares first and second level domains.
A comma-seperated list of "Cc" addresses (RFC 2822 format), or a hash of each address.
Count of "Reply-to" addresses (normally one). If a "Reply-to" address is not supplied in the message, the "From" address is used.
Count of "Reply-to" addresses that are in the same domain as the mail store. This only compares first and second level domains.
A comma-seperated list of "Reply-to" addresses (RFC 2822 format), or a hash of each address.
Subject, or hash of the Subject, with all "Re:", "Fwd:, and "FW:" prefixes stripped.
Host name of the message store server.
Username of this message store.
Full name of the folder, including the path to the folder for folders inside folders.
Seen flag, set on messages that have been read.
Answered flag, set on messages that have been replied to. Note: this is not used by all message store servers.
Flagged flag, set on messages flagged for special handling. Note: this is not used by all message store servers.
Draft flag, set on unsent drafts of outbound messages. Note: this is not used by all message store servers.
Size of the message in bytes.
"Importance" field, if set in the message.
"X-MS-Has-Attach" field, which is used by some message store servers to indicate if the message contains an attachment.
An empty field, for future use.

The "hash" of data in a field is produced via the process described in Hash Function below.

Product Functions

MailHeaderDump performs the following actions when executed:

Initialize the debug logging facility.
Initialize the user interface facility.
Read the previous state of the user interface selections from the platform's preferences.
Present the main window to the user.
Wait for the user to set the parameters in the main window and press the Start button.
Disable any further user input to the main window.
Save the state of the user interface selections to the platform's preferences.
Open the output file. The file name is the user supplied prefix, plus the username, plus a hyphen, plus the current date and time in the form yyyyMMddHHmmss, plus ".txt".
Connect to the server-based message store specified by the user in the mail server URL. During the opening of this connection, an authentication window is presented to request a username and password to use to gain access to the message store.
Starting with the message store's default folder, traverse the hierarchy of folders below it (depth-first traversal). For each folder, perform the following:
If the folder name is "Calendar", "Contacts", or "Tasks", skip it.
Iterate across all messages in the folder, and write the message attributes to a record in the output file. See File Output Format for the attributes captured.
Move to the next folder, repeating steps 11 & 12, until all folders have been processed.
Close the connection to the message store.
Close the output file.
Close the main window.
Exit.

Hash Function

A "hash" or message digest encodes a string into a "fingerprint" that is nearly unique for the input byte sequence, but does not reveal the contents of the input. For example "secret" might always digest into "123", but there is no way to decode "123" to "secret". However, if one knows two input strings resulted in the same fingerprint (both "123"), then they are very probably the same input strings.

MailHeaderDump can output several fields in a hashed form, at the user's option. These are processed as follows:

For addresses, only the addr-spec address (part in angle brackets, ignoring recipient name) is used.
The case is normalized (upper case for subjects, lower case for addresses and message-IDs).
The resulting character string is rendered in the UTF-8 encoding and processed using the MD5 message digest algorithm.
The resulting 128 bit message digest is treated as an unsigned integer and output in base 32, using digits 0-9 and a-v.

Assumptions and Dependencies

Assumption: Mail messages of interest reside in a server-based message store, accessible via the IMAP4 protocol.
Assumption: Output file is to be read into Microsoft Excel.
Assumption: The 26 data elements noted in File Output Format, above, are sufficient.
Assumption: User to mail store server connection is over a secured channel, such as a VPN tunnel or physically secure network link, because mail data and authentication credentials are transmitted in the clear.
Dependency: Java SE, version 1.4, installed in user environment.
Dependency: JavaMail 1.3.3 (and therefore JAF 1.0.2), bundled in MailHeaderDump.jar.