Stealing a person's identity is one of the most profitable crimes committed by criminals. Among 1.3 million complaints received by the Federal Trade Commission in 2009, identity theft ranked first and accounted for 21% of the complaints costing consumers over 1.7 billion US dollars . Identity theft has been around for many years while the means of committing it has changed with technology. The traditional way criminals steal a person's identity is by killing the individual. Another way to steal identity is using phone scams, where in, criminals inform the person that they have won a sweepstake, and convince the user to reveal some personal information to claim the money. The more popular method of identity theft that is prevalent even today is called Dumpster Diving. When people discard letters, financial records, and other personal information in the garbage dump without shredding, criminals scavenge those dumps looking for sensitive information such as credit card, bank account social security numbers, and use that information to commit crimes.
With the advent of Internet, the most popular way to steal identity is through "phishing". Like in traditional fishing where fishermen trolls the river in a boat to catch fish, in "phishing", attackers trolls the Internet using email message with convincing content as baits to steal users personal information. The email directs the user via a hyperlink to a website owned by criminals that looks very similar to a legitimate website. The user will then be asked to enter personal and financial information either to update existing information or to purchase a product. In reality, this lets the criminal to have access to that valuable information which they then use to commit fraud or to sell it to a bidder. Phishers can also trick users into downloading malicious codes or malware after they click on a link embedded in the email. This is a useful tool in crimes like economic espionage where sensitive internal communications can be accessed and trade secrets stolen. Phishing has been around since 1996 but has become more common and more sophisticated. Recent phishing attack on the Gmail system stole emails of US government officials, contractors, and military personnel .
Considerable research has been done towards protecting users from phishing attacks. They include firewalls, black listing certain domains and Internet protocol (IP) addresses, spam filtering techniques, client side toolbars, and user education. Each of these existing techniques has some advantages and some disadvantages. For example, existing filters have misclassification rates, the blacklist approach is harder to maintain with every expanding IP address/domain space, while the user ignores client side toolbar warnings.
The main contribution of this research is a multi-layered phishing detection method using previously developed modeling techniques that includes topic modeling technique Probabilistic Latent Semantic Analysis (PLSA), classifier ensemble technique AdaBoost and Co-Training algorithm that employs labeled and unlabeled data. The main goal of our novel approach is to detect phishing before it gets to the user. Towards that goal, we have developed the detection method, called phishGILLNET, by incorporating the power of natural language processing techniques. Similar to a "gillnet" that catches fish by its gill thus preventing its movement once caught, phishGILLNET tries to catch phishing attacks by the tone, wordings, and other linguistic variations in the content. By serving as a server side filter, phishGILLNET prevents movement of a phish towards the end user. The first layer of phishGILLNET (phishGILLNET1) employs PLSA to build a topic model and uses a topic level similarity function for classification. Unlike earlier approach that employed topic models, our model employed editing function and dictionary lookups to specifically account for intentionally misspelled words in phishing emails. The second layer of phishGILLNET (phishGILLNET2) employs classifier ensemble technique AdaBoost and topic probabilities as features to build a robust classifier using several base learners. To further expand phishGILLNET to handle labeled and unlabeled email data, the third layer (phishGILLNET3) employs Co-Training to build a classifier using topic distributions as features and the best classification technique obtained in the second layer. To the best of the authors' knowledge, this is the first attempt that demonstrates the power of topic model using Co-Training for phishing detection. The size of the corpus we employed is significantly larger (approximately 400,000) than that employed by authors of the Co-Training technique (few thousands) as well as by earlier researchers. Thus, our research is an additional proof of concept of the Co-Training algorithm in employing unlabeled data.
This article is organized as follows. We first review the state-of-the-art protection techniques and present their advantages and disadvantages (see Section 2). The multi-layered phishing detection method phishGILLNET is presented in Section 3. The modeling techniques employed by phishGILLNET namely PLSA, AdaBoost, and Co-Training are described in Sections 4, 5, and 6, respectively. The experimental design is presented in Section 7. The architectural components and results obtained on the public corpus for each layer of phishGILLNET, namely, phishGILLNET1, phishGILLNET2, and phishGILLNET3, are presented in Sections 8, 9, and 10, respectively. The performance comparison with the state-of-the-art tools is presented in Section 11. This article concludes with a discussion of the developed methodology and suggestions for future research in Section 12.