aaron.harnly.net

Email Thread Reassembly Using Similarity Matching

J Yeh and A Harnly, 2006. Email Thread Reassembly Using Similarity Matching. Conference on Email and Anti-Spam (CEAS). pdf[PDF] | markdown[plaintext] | markdown[Bibtex]

Email thread reassembly is the task of linking messages by parent- child relationships. In this paper, we present two approaches to address this problem. One exploits previously undocumented header information from the Microsoft Exchange Protocol. The other uses string similarity metrics and a heuristic algorithm to reassemble threads in the absence of header information. The pros and cons of both methods are discussed. The similarity matching method is evaluated using the Enron email corpus and found to perform well.

Full Text

ABSTRACT

Email thread reassembly is the task of linking messages by parent- child relationships. In this paper, we present two approaches to address this problem. One exploits previously undocumented header information from the Microsoft Exchange Protocol. The other uses string similarity metrics and a heuristic algorithm to reassemble threads in the absence of header information. The pros and cons of both methods are discussed. The similarity matching method is evaluated using the Enron email corpus and found to perform well.

1. INTRODUCTION

One key difference between emails and other types of documents is the existence of threading, i.e. hierarchical, referential relationships among emails. Recently, email thread structure has been profitably employed in several applications, including email search [3], email summarization [9], email classification [1], and visualization [5]. However, the lack of reliable, widely applicable methods for thread reassembly has limited the use of thread structure.

Email thread reassembly is the task of relating messages by parent-child relationships, grouping messages together based on which messages are replies to which others. In many cases, this task can be achieved based on specific data within email headers. However, no standard protocol for thread structure headers is universally observed, making thread reassembly

In this paper, we present two approaches to threading email messages. The first employs a specific header – “Thread-Index,� which is defined in the Microsoft Exchange Protocol, while the second links two messages by mainly measuring the content similarity between them. It takes account of several heuristics as well, such as subject, time, and sender/recipient relationships among emails. Furthermore, since some messages in a thread may not exist in the corpus (e.g., if deleted), we also discuss how to recover missing messages. Here, a missing message, as defined in [2], is an email that does not itself present in the archive but has been quoted in subsequent emails kept in a user’s folder. The contributions of this work are twofold. First, this paper offers a method of thread reassembly in the absence of header information. Second, we evaluated the method in a case study with the Enron corpus. In the following, Section 2 introduces previous related work. In Sections 3-4, we describe the proposed methods which aim to address the email thread reassembly task. Preliminary results and discussions are given in Section 5 and Section 6.

Digg this     Create a del.icio.us Bookmark     Add to Newsvine

No Responses to “Email Thread Reassembly Using Similarity Matching”

No comments yet

Leave a Reply

*
To prove you're a person (not a spam script), type the security word shown in the picture. Click on the picture to hear an audio file of the word.
Click to hear an audio file of the anti-spam word