6533b86dfe1ef96bd12ca621
RESEARCH PRODUCT
Automatic Categorization of Web Sites
Lida Zhusubject
VDP::Mathematics and natural science: 400::Information and communication science: 420::Knowledge based systems: 425IKT590description
Masteroppgave i informasjons- og kommunikasjonsteknologi 2008 – Universitetet i Agder, Grimstad In this thesis we have presented a solution to classify websites into geographical attribute code (NUTS) and economical activities attribute codes(NACE). We propose a solution for web site classification with high accuracy. We use keywordbased document classification methods which had shown good performance. After classification, each document is assigned a class label from a set of predefined categories, which is based on a pool of pre-classified sample documents. Our solution includes to remove stop words and skip html tags, which identify the informative term, remove the non-informative or redundant terms to improve the classification accuracy; use mutual information for feature selection to reduce the dimensional feature space and produce vectors for classification; finally, use Naïve Bayes and Decision Tree algorithm to perform the classification and also provide the performance comparison.The system has shown great performance in the experiment. It classifies web sites into NACE categories with maximum accuracy of 97% performed on 46 web pages, while NUTS classification has best accuracy of 93% performed on 223 web pages.
| year | journal | country | edition | language |
|---|---|---|---|---|
| 2008-01-01 |