Search results for " mining"
showing 10 items of 1548 documents
Estimation of National Colorectal-Cancer Incidence Using Claims Databases
2012
Background.The aim of the study was to assess the accuracy of the colorectal-cancer incidence estimated from administrative data.Methods.We selected potential incident colorectal-cancer cases in 2004-2005 French administrative data, using two alternative algorithms. The first was based only on diagnostic and procedure codes, whereas the second considered the past history of the patient. Results of both methods were assessed against two corresponding local cancer registries, acting as “gold standards.” We then constructed a multivariable regression model to estimate the corrected total number of incident colorectal-cancer cases from the whole national administrative database.Results.The firs…
Dealing with spatial data pooled over time in statistical models
2012
Recent developments in spatial econometrics have been devoted to spatio-temporal data and how spatial panel data structure should be modeled. Little effort has been devoted to the way one must deal with spatial data pooled over time. This paper presents the characteristics of spatial data pooled over time and proposes a simple way to take into account unidirectional temporal effect as well as multidirectional spatial effect in the estimation process. An empirical example, using data on 25,357 single family homes sold in Lucas County, OH (USA), between 1993 and 1998 (available in the MatLab library), is used to illustrate the potential of the approach proposed.
Missing Value Estimation for Microarray Data by Bayesian Principal Component Analysis and Iterative Local Least Squares
2013
Published version of an article from the journal: Mathematical Problems in Engineering. Also available from Hindawi: http://dx.doi.org/10.1155/2013/162938 Missing values are prevalent in microarray data, they course negative influence on downstream microarray analyses, and thus they should be estimated from known values. We propose a BPCA-iLLS method, which is an integration of two commonly used missing value estimation methods-Bayesian principal component analysis (BPCA) and local least squares (LLS). The inferior row-average procedure in LLS is replaced with BPCA, and the least squares method is put into an iterative framework. Comparative result shows that the proposed method has obtaine…
smatr 3 - an R package for estimation and inference about allometric lines
2011
Summary 1. The Standardised Major Axis Tests and Routines (SMATR) software provides tools for estimation and inference about allometric lines, currently widely used in ecology and evolution. 2. This paper describes some significant improvements to the functionality of the package, now available on R in smatr version 3. 3. New inclusions in the package include sma and ma functions that accept formula input and perform the key inference tasks; multiple comparisons; graphical methods for visualising data and checking (S)MA assumptions; robust (S)MA estimation and inference tools.
Hunting for valuables from landfills and assessing their market opportunities A case study with Kudjape landfill in Estonia
2017
Landfill mining is an alternative technology that merges the ideas of material recycling and sustainable waste management. This paper reports a case study to estimate the value of landfilled materials and their respective market opportunities, based on a full-scale landfill mining project in Estonia. During the project, a dump site (Kudjape, Estonia) was excavated with the main objectives of extracting soil-like final cover material with the function of methane degradation. In total, about 57,777 m3 of waste was processed, particularly the uppermost 10-year layer of waste. Manual sorting was performed in four test pits to determine the detailed composition of wastes. 11,610 kg of waste was…
The upgraded HADES trigger and data acquisition system
2011
The HADES experiment is a High Acceptance Di-Electron Spectrometer located at GSI in Darmstadt, Germany. Recently, its trigger and data acquisition system was upgraded. The main goal was to substantially increase the event rate capability by a factor of up to 20 to reach 100 kHz in light and 20 kHz in heavy ion reaction systems. The total data rate written to storage is about 400 MByte/s in peak.In this context, the complete read-out system was exchanged to FPGA-based platforms using optical communication. For data transport a general-purpose real-time network protocol was developed to meet the strong requirements of the system. In particular, trigger information has to reach all front-end …
CUDA-Accelerated Alignment of Subsequences in Streamed Time Series Data
2014
Euclidean Distance (ED) and Dynamic Time Warping (DTW) are cornerstones in the field of time series data mining. Many high-level algorithms like kNN-classification, clustering or anomaly detection make excessive use of these distance measures as subroutines. Furthermore, the vast growth of recorded data produced by automated monitoring systems or integrated sensors establishes the need for efficient implementations. In this paper, we introduce linear memory parallelization schemes for the alignment of a given query Q in a stream of time series data S for both ED and DTW using CUDA-enabled accelerators. The ED parallelization features a log-linear calculation scheme in contrast to the naive …
Criminal networks analysis in missing data scenarios through graph distances
2021
Data collected in criminal investigations may suffer from issues like: (i) incompleteness, due to the covert nature of criminal organizations; (ii) incorrectness, caused by either unintentional data collection errors or intentional deception by criminals; (iii) inconsistency, when the same information is collected into law enforcement databases multiple times, or in different formats. In this paper we analyze nine real criminal networks of different nature (i.e., Mafia networks, criminal street gangs and terrorist organizations) in order to quantify the impact of incomplete data, and to determine which network type is most affected by it. The networks are firstly pruned using two specific m…
GEM
2014
The widespread use of digital sensor systems causes a tremendous demand for high-quality time series analysis tools. In this domain the majority of data mining algorithms relies on established distance measures like Dynamic Time Warping (DTW) or Euclidean distance (ED). However, the notion of similarity induced by ED and DTW may lead to unsatisfactory clusterings. In order to address this shortcoming we introduce the Gliding Elastic Match (GEM) algorithm. It determines an optimal local similarity measure of a query time series Q and a subject time series S. The measure is invariant under both local deformation on the measurement-axis and scaling in the time domain. GEM is compared to ED and…
Data Mining Algorithms for Knowledge Extraction
2020
In this paper, we study the methods, techniques, and algorithms used in data mining, and from the studied algorithms, we emphasized the clustering algorithms, more precisely on the K-means algorithm. This algorithm was first studied using the Euclidean distance, then modifying the distance between the clusters using the distances Mahalanobis and Canberra. After implementing the algorithms in C/C++, we compared the clustering of the three algorithms, after which we modified them and studied the distance between the clusters.