0000000000234233

AUTHOR

Thomas Gottron

Readability and the Web

Readability indices measure how easy or difficult it is to read and comprehend a text. In this paper we look at the relation between readability indices and web documents from two different perspectives. On the one hand we analyse how to reliably measure the readability of web documents by applying content extraction techniques and incorporating a bias correction. On the other hand we investigate how web based corpus statistics can be used to measure readability in a novel and language independent way.

research product

A Comparison of Language Identification Approaches on Short, Query-Style Texts

In a multi-language Information Retrieval setting, the knowledge about the language of a user query is important for further processing. Hence, we compare the performance of some typical approaches for language detection on very short, query-style texts. The results show that already for single words an accuracy of more than 80% can be achieved, for slightly longer texts we even observed accuracy values close to 100%.

research product

Content Code Blurring: A New Approach to Content Extraction

Most HTML documents on the world wide web contain far more than the article or text which forms their main content. Navigation menus, functional and design elements or commercial banners are typical examples of additional contents. Content extraction is the process of identifying the main content and/or removing the additional contents. We introduce content code blurring, a novel content extraction algorithm. As the main text content is typically a long, homogeneously formatted region in a web document, the aim is to identify exactly these regions in an iterative process. Comparing its performance with existing content extraction solutions we show thatfor most documents content code blurrin…

research product

Efficient Graph Models for Retrieving Top-k News Feeds from Ego Networks

A key challenge of web platforms like social networking sites and services for news feed aggregation is the efficient and targeted distribution of new content items to users. This can be formulated as the problem of retrieving the top-k news items out of the d-degree ego network of each given user, where the set of all users producing feeds is of size n, with n >> d >> k and typically k

research product

Document Word Clouds: Visualising Web Documents as Tag Clouds to Aid Users in Relevance Decisions

Περιέχει το πλήρες κείμενο Information Retrieval systems spend a great effort on determining the significant terms in a document. When, instead, a user is looking at a document he cannot benefit from such information. He has to read the text to understand which words are important. In this paper we take a look at the idea of enhancing the perception of web documents with visualisation techniques borrowed from the tag clouds of Web 2.0. Highlighting the important words in a document by using a larger font size allows to get a quick impression of the relevant concepts in a text. As this process does not depend on a user query it can also be used for explorative search. A user study showed, th…

research product

Combining content extraction heuristics

The main text content of an HTML document on the WWW is typically surrounded by additional contents, such as navigation menus, advertisements, link lists or design elements. Content Extraction (CE) is the task to identify and extract the main content. Ongoing research has spawned several CE heuristics of different quality. However, so far only the Crunch framework combines several heuristics to improve its overall CE performance. Since Crunch, though, many new algorithms have been formulated. The CombinE system is designed to test, evaluate and optimise combinations of CE heuristics. Its aim is to develop CE systems which yield better and more reliable extracts of the main content of a web …

research product

Estimating web site readability using content extraction

Nowadays, information is primarily searched on the WWW. From a user perspective, the readability is an important criterion for measuring the accessibility and thereby the quality of an information. We show that modern content extraction algorithms help to estimate the readability of a web document quite accurate.

research product

Alignment of Noisy and Uniformly Scaled Time Series

The alignment of noisy and uniformly scaled time series is an important but difficult task. Given two time series, one of which is a uniformly stretched subsequence of the other, we want to determine the stretching factor and the offset of the second time series within the first one. We adapted and enhanced different methods to address this problem: classical FFT-based approaches to determine the offset combined with a naive search for the stretching factor or its direct computation in the frequency domain, bounded dynamic time warping and a new approach called shotgun analysis, which is inspired by sequencing and reassembling of genomes in bioinformatics. We thoroughly examined the strengt…

research product

Towards Bankruptcy Prediction: Deep Sentiment Mining to Detect Financial Distress from Business Management Reports

Due to their disclosure required by law, business management reports have become publicly available for a large number of companies, and these reports offer the opportunity to assess the financial health or distress of a company, both quantitatively from the balance sheets and qualitatively from the text. In this paper, we analyze the potential of deep sentiment mining from the textual parts of business management reports and aim to detect signals for financial distress. We (1) created the largest corpus of business reports analyzed qualitatively to date, (2) defined a non-trivial target variable based on the so-called Altman Z-score, (3) developed a filtering of sentences based on class-co…

research product