6533b839fe1ef96bd12a5e48

RESEARCH PRODUCT

Spam classification for online discussions

Hao Wu

subject

ComputingMethodologies_PATTERNRECOGNITION

description

Masteroppgave i informasjons- og kommunikasjonsteknologi 2010 – Universitetet i Agder, Grimstad Traditionally, spam messages filtering systems are built by integrating content-based analysis technologies which are developed from the experiences of dealing with E-mail spam. Recently, the new style of information appears in the Internet, Social Media platform, which also expands the space for Internet abusers. In this thesis, we not only evaluated the traditional content-based approaches to classify spam messages, we also investigated the possibility of integrating context-based technology with con-tent-based approaches to classify spam messages. We built spam classifiers using Novelty de-tection approach combining with Naïve Bayes, k Nearest-Neighbour and Self-organizing map respectively and tested each of them with vast amount of experiment data. And we also took a further step from the previous researches by integrating Self-organizing map with Naive Bayes to carry out the spam classification. The results of this thesis show that combining context-based approaches with content-based spam classifier wisely can actually improve the performance of content-based spam classifier in variant of directions. In addition, the results from Self-organizing map classifier with Naïve Bayes show a promising future for data clustering method using in spam filtering. Thus we believe this thesis presents a new insight in Natural Language Processing and the methods and techniques proposed in this thesis provide researchers in spam filtering field a good tool to analyze context-based spam messages.

http://hdl.handle.net/11250/137488