0000000000234233

AUTHOR

Thomas Gottron

showing 9 related works from this author

Readability and the Web

2012

Readability indices measure how easy or difficult it is to read and comprehend a text. In this paper we look at the relation between readability indices and web documents from two different perspectives. On the one hand we analyse how to reliably measure the readability of web documents by applying content extraction techniques and incorporating a bias correction. On the other hand we investigate how web based corpus statistics can be used to measure readability in a novel and language independent way.

060201 languages & linguisticsMeasure (data warehouse)Information retrievalcontent extractionlcsh:T58.5-58.64Relation (database)lcsh:Information technologyComputer Networks and CommunicationsComputer sciencebusiness.industryweb document readability; content extraction; corpus statistics06 humanities and the arts02 engineering and technologycorpus statisticsReadabilityWorld Wide Webweb document readability0602 languages and literatureContent extractionComputingMethodologies_DOCUMENTANDTEXTPROCESSING0202 electrical engineering electronic engineering information engineeringWeb application020201 artificial intelligence & image processingBias correctionbusinessFuture Internet
researchProduct

A Comparison of Language Identification Approaches on Short, Query-Style Texts

2010

In a multi-language Information Retrieval setting, the knowledge about the language of a user query is important for further processing. Hence, we compare the performance of some typical approaches for language detection on very short, query-style texts. The results show that already for single words an accuracy of more than 80% can be achieved, for slightly longer texts we even observed accuracy values close to 100%.

Information retrievalLanguage identificationComputer sciencebusiness.industryArtificial intelligencecomputer.software_genrebusinesscomputerNatural language processingStyle (sociolinguistics)
researchProduct

Content Code Blurring: A New Approach to Content Extraction

2008

Most HTML documents on the world wide web contain far more than the article or text which forms their main content. Navigation menus, functional and design elements or commercial banners are typical examples of additional contents. Content extraction is the process of identifying the main content and/or removing the additional contents. We introduce content code blurring, a novel content extraction algorithm. As the main text content is typically a long, homogeneously formatted region in a web document, the aim is to identify exactly these regions in an iterative process. Comparing its performance with existing content extraction solutions we show thatfor most documents content code blurrin…

Information retrievalComputer sciencebusiness.industryContent (measure theory)Content extractionProcess (computing)Code (cryptography)businessKnowledge acquisitionContent management2008 19th International Conference on Database and Expert Systems Applications
researchProduct

Efficient Graph Models for Retrieving Top-k News Feeds from Ego Networks

2012

A key challenge of web platforms like social networking sites and services for news feed aggregation is the efficient and targeted distribution of new content items to users. This can be formulated as the problem of retrieving the top-k news items out of the d-degree ego network of each given user, where the set of all users producing feeds is of size n, with n >> d >> k and typically k

Ego networksInformation retrievalGraph databaseTheoretical computer scienceSocial networkComputer sciencebusiness.industryScalabilityGraph (abstract data type)Graph theorybusinesscomputer.software_genrecomputer2012 International Conference on Privacy, Security, Risk and Trust and 2012 International Confernece on Social Computing
researchProduct

Document Word Clouds: Visualising Web Documents as Tag Clouds to Aid Users in Relevance Decisions

2009

Περιέχει το πλήρες κείμενο Information Retrieval systems spend a great effort on determining the significant terms in a document. When, instead, a user is looking at a document he cannot benefit from such information. He has to read the text to understand which words are important. In this paper we take a look at the idea of enhancing the perception of web documents with visualisation techniques borrowed from the tag clouds of Web 2.0. Highlighting the important words in a document by using a larger font size allows to get a quick impression of the relevant concepts in a text. As this process does not depend on a user query it can also be used for explorative search. A user study showed, th…

Information retrievalProcess (engineering)Computer sciencemedia_common.quotation_subjectDocument clusteringUser requirements documentWorld Wide WebPerceptionRelevance (information retrieval)Tag cloudtf–idfΤεχνικές υπηρεσίες σε βιβλιοθήκες αρχεία και μουσείαTechnical services in libraries archives and museumsWord (computer architecture)media_common
researchProduct

Combining content extraction heuristics

2008

The main text content of an HTML document on the WWW is typically surrounded by additional contents, such as navigation menus, advertisements, link lists or design elements. Content Extraction (CE) is the task to identify and extract the main content. Ongoing research has spawned several CE heuristics of different quality. However, so far only the Crunch framework combines several heuristics to improve its overall CE performance. Since Crunch, though, many new algorithms have been formulated. The CombinE system is designed to test, evaluate and optimise combinations of CE heuristics. Its aim is to develop CE systems which yield better and more reliable extracts of the main content of a web …

Information retrievalComputer sciencemedia_common.quotation_subjectDesign elements and principlescomputer.software_genreCrunchTask (project management)Content extractionQuality (business)Data miningHeuristicsWeb documentcomputermedia_commonProceedings of the 10th International Conference on Information Integration and Web-based Applications & Services
researchProduct

Estimating web site readability using content extraction

2009

Nowadays, information is primarily searched on the WWW. From a user perspective, the readability is an important criterion for measuring the accessibility and thereby the quality of an information. We show that modern content extraction algorithms help to estimate the readability of a web document quite accurate.

Information retrievalbusiness.industryComputer sciencemedia_common.quotation_subjectContent extractionQuality (business)UsabilitybusinessReadabilitymedia_commonWeb siteProceedings of the 18th international conference on World wide web
researchProduct

Alignment of Noisy and Uniformly Scaled Time Series

2009

The alignment of noisy and uniformly scaled time series is an important but difficult task. Given two time series, one of which is a uniformly stretched subsequence of the other, we want to determine the stretching factor and the offset of the second time series within the first one. We adapted and enhanced different methods to address this problem: classical FFT-based approaches to determine the offset combined with a naive search for the stretching factor or its direct computation in the frequency domain, bounded dynamic time warping and a new approach called shotgun analysis, which is inspired by sequencing and reassembling of genomes in bioinformatics. We thoroughly examined the strengt…

Mathematical optimizationDynamic time warpingComputer scienceFrequency domainOutlierFast Fourier transformAlgorithm
researchProduct

Towards Bankruptcy Prediction: Deep Sentiment Mining to Detect Financial Distress from Business Management Reports

2018

Due to their disclosure required by law, business management reports have become publicly available for a large number of companies, and these reports offer the opportunity to assess the financial health or distress of a company, both quantitatively from the balance sheets and qualitatively from the text. In this paper, we analyze the potential of deep sentiment mining from the textual parts of business management reports and aim to detect signals for financial distress. We (1) created the largest corpus of business reports analyzed qualitatively to date, (2) defined a non-trivial target variable based on the so-called Altman Z-score, (3) developed a filtering of sentences based on class-co…

050208 financeComputer science05 social sciencesSentiment analysis050201 accountingData scienceTask (project management)VisualizationDistressBankruptcy0502 economics and businessTask analysisBankruptcy predictionBalance sheet2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA)
researchProduct