Jouni Helske
Comparison of Attention Behaviour Across User Sets through Automatic Identification of Common Areas of Interest
Eye tracking is used to analyze and compare user behaviour within numerous domains, but long duration eye tracking experiments across multiple users generate millions of eye gaze samples, making th ...
Can visualization alleviate dichotomous thinking? Effects of visual representations on the cliff effect
Common reporting styles for statistical results in scientific articles, such as $p$ p -values and confidence intervals (CI), have been reported to be prone to dichotomous interpretations, especially with respect to the null hypothesis significance testing framework. For example when the $p$ p -value is small enough or the CIs of the mean effects of a studied drug and a placebo are not overlapping, scientists tend to claim significant differences while often disregarding the magnitudes and absolute differences in the effect sizes. This type of reasoning has been shown to be potentially harmful to science. Techniques relying on the visual estimation of the strength of evidence have been recom…
Estimating the causal effect of timing on the reach of social media posts
AbstractModern companies regularly use social media to communicate with their customers. In addition to the content, the reach of a social media post may depend on the season, the day of the week, and the time of the day. We consider optimizing the timing of Facebook posts by a large Finnish consumers’ cooperative using historical data on previous posts and their reach. The content and the timing of the posts reflect the marketing strategy of the cooperative. These choices affect the reach of a post via a dynamic process where the reactions of users make the post more visible to others. We describe the causal relations of the social media publishing in the form of a directed acyclic graph, …
Estimating aggregated nutrient fluxes in four Finnish rivers via Gaussian state space models
Reliable estimates of the nutrient fluxes carried by rivers from land-based sources to the sea are needed for efficient abatement of marine eutrophication. Although nutrient concentrations in rivers generally display large temporal variation, sampling and analysis for nutrients, unlike flow measurements, are rarely performed on a daily basis. The infrequent data calls for ways to reliably estimate the nutrient concentrations of the missing days. Here, we use the Gaussian state space models with daily water flow as a predictor variable to predict missing nutrient concentrations for four agriculturally impacted Finnish rivers. Via simulation of Gaussian state space models, we are able to esti…
Graphical model inference : Sequential Monte Carlo meets deterministic approximations
Approximate inference in probabilistic graphical models (PGMs) can be grouped into deterministic methods and Monte-Carlo-based methods. The former can often provide accurate and rapid inferences, but are typically associated with biases that are hard to quantify. The latter enjoy asymptotic consistency, but can suffer from high computational costs. In this paper we present a way of bridging the gap between deterministic and stochastic inference. Specifically, we suggest an efficient sequential Monte Carlo (SMC) algorithm for PGMs which can leverage the output from deterministic inference methods. While generally applicable, we show explicitly how this can be done with loopy belief propagati…
Estimating aggregated nutrient fluxes in four Finnish rivers via Gaussian state space models
Reliable estimates of the nutrient fluxes carried by rivers from land-based sources to the sea are needed for efficient abatement of marine eutrophication. Although nutrient concentrations in rivers generally display large temporal variation, sampling and analysis for nutrients, unlike flow measurements, are rarely performed on a daily basis. The infrequent data calls for ways to reliably estimate the nutrient concentrations of the missing days. Here, we use the Gaussian state space models with daily water flow as a predictor variable to predict missing nutrient concentrations for four agriculturally impacted Finnish rivers. Via simulation of Gaussian state space models, we are able to esti…
Analysing Complex Life Sequence Data with Hidden Markov Modelling
When analysing complex sequence data with multiple channels (dimensions) and long observation sequences, describing and visualizing the data can be a challenge. Hidden Markov models (HMMs) and their mixtures (MHMMs) offer a probabilistic model-based framework where the information in such data can be compressed into hidden states (general life stages) and clusters (general patterns in life courses). We studied two different approaches to analysing clustered life sequence data with sequence analysis (SA) and hidden Markov modelling. In the first approach we used SA clusters as fixed and estimated HMMs separately for each group. In the second approach we treated SA clusters as suggestive and …
Estimation of causal effects with small data in the presence of trapdoor variables
We consider the problem of estimating causal effects of interventions from observational data when well-known back-door and front-door adjustments are not applicable. We show that when an identifiable causal effect is subject to an implicit functional constraint that is not deducible from conditional independence relations, the estimator of the causal effect can exhibit bias in small samples. This bias is related to variables that we call trapdoor variables. We use simulated data to study different strategies to account for trapdoor variables and suggest how the related trapdoor bias might be minimized. The importance of trapdoor variables in causal effect estimation is illustrated with rea…
Prediction and interpolation of time series by state space models
Artikkeliväitöskirja. Sisältää yhteenveto-osan ja neljä artikkelia. Article dissertation. Contains an introduction part and four articles. A large amount of data collected today is in the form of a time series. In order to make realistic inferences based on time series forecasts, in addition to point predictions, prediction intervals or other measures of uncertainty should be presented. Multiple sources of uncertainty are often ignored due to the complexities involved in accounting them correctly. In this dissertation, some of these problems are reviewed and some new solutions are presented. A state space approach is also advocated for an e cient and exible framework for time series forecas…
Introducing libeemd: a program package for performing the ensemble empirical mode decomposition
The ensemble empirical mode decomposition (EEMD) and its complete variant (CEEMDAN) are adaptive, noise-assisted data analysis methods that improve on the ordinary empirical mode decomposition (EMD). All these methods decompose possibly nonlinear and/or nonstationary time series data into a finite amount of components separated by instantaneous frequencies. This decomposition provides a powerful method to look into the different processes behind a given time series data, and provides a way to separate short time-scale events from a general trend. We present a free software implementation of EMD, EEMD and CEEMDAN and give an overview of the EMD methodology and the algorithms used in the deco…
Efficient Bayesian generalized linear models with time-varying coefficients : The walker package in R
The R package walker extends standard Bayesian general linear models to the case where the effects of the explanatory variables can vary in time. This allows, for example, to model the effects of interventions such as changes in tax policy which gradually increases their effect over time. The Markov chain Monte Carlo algorithms powering the Bayesian inference are based on Hamiltonian Monte Carlo provided by Stan software, using a state space representation of the model to marginalise over the regression coefficients for efficient low-dimensional sampling.
A Bayesian spatio‐temporal analysis of markets during the Finnish 1860s famine
We develop a Bayesian spatio-temporal model to study pre-industrial grain market integration during the Finnish famine of the 1860s. Our model takes into account several problematic features often present when analysing multiple spatially interdependent time series. For example, compared with the error correction methodology commonly applied in econometrics, our approach allows simultaneous modelling of multiple interdependent time series avoiding cumbersome statistical testing needed to predetermine the market leader as a point of reference. Furthermore, introducing a flexible spatio-temporal structure enables analysing detailed regional and temporal dynamics of the market mechanisms. Appl…
A nonlinear mixed model approach to predict energy expenditure from heart rate.
Abstract Objective. Heart rate (HR) monitoring provides a convenient and inexpensive way to predict energy expenditure (EE) during physical activity. However, there is a lot of variation among individuals in the EE-HR relationship, which should be taken into account in predictions. The objective is to develop a model that allows the prediction of EE based on HR as accurately as possible and allows an improvement of the prediction using calibration measurements from the target individual. Approach. We propose a nonlinear (logistic) mixed model for EE and HR measurements and an approach to calibrate the model for a new person who does not belong to the dataset used to estimate the model. The …
Combining Sequence Analysis and Hidden Markov Models in the Analysis of Complex Life Sequence Data
Life course data often consists of multiple parallel sequences, one for each life domain of interest. Multichannel sequence analysis has been used for computing pairwise dissimilarities and finding clusters in this type of multichannel (or multidimensional) sequence data. Describing and visualizing such data is, however, often challenging. We propose an approach for compressing, interpreting, and visualizing the information within multichannel sequences by finding (1) groups of similar trajectories and (2) similar phases within trajectories belonging to the same group. For these tasks we combine multichannel sequence analysis and hidden Markov modelling. We illustrate this approach with an …
Importance sampling type estimators based on approximate marginal Markov chain Monte Carlo
We consider importance sampling (IS) type weighted estimators based on Markov chain Monte Carlo (MCMC) targeting an approximate marginal of the target distribution. In the context of Bayesian latent variable models, the MCMC typically operates on the hyperparameters, and the subsequent weighting may be based on IS or sequential Monte Carlo (SMC), but allows for multilevel techniques as well. The IS approach provides a natural alternative to delayed acceptance (DA) pseudo-marginal/particle MCMC, and has many advantages over DA, including a straightforward parallelisation and additional flexibility in MCMC implementation. We detail minimal conditions which ensure strong consistency of the sug…
bssm: Bayesian Inference of Non-linear and Non-Gaussian State Space Models in R
We present an R package bssm for Bayesian non-linear/non-Gaussian state space modelling. Unlike the existing packages, bssm allows for easy-to-use approximate inference based on Gaussian approximations such as the Laplace approximation and the extended Kalman filter. The package accommodates also discretely observed latent diffusion processes. The inference is based on fully automatic, adaptive Markov chain Monte Carlo (MCMC) on the hyperparameters, with optional importance sampling post-correction to eliminate any approximation bias. The package implements also a direct pseudo-marginal MCMC and a delayed acceptance pseudo-marginal MCMC using intermediate approximations. The package offers …
Improved Frequentist Prediction Intervals for Autoregressive Models by Simulation
It is well known that the so called plug-in prediction intervals for autoregressive processes, with Gaussian disturbances, are too narrow, i.e. the coverage probabilities fall below the nominal ones. However, simulation experiments show that the formulas borrowed from the ordinary linear regression theory yield one-step prediction intervals, which have coverage probabilities very close to what is claimed. From a Bayesian point of view the resulting intervals are posterior predictive intervals when uniform priors are assumed for both autoregressive coefficients and logarithm of the disturbance variance. This finding opens the path how to treat multi-step prediction intervals which are obtain…
Importance sampling type estimators based on approximate marginal Markov chain Monte Carlo
We consider importance sampling (IS) type weighted estimators based on Markov chain Monte Carlo (MCMC) targeting an approximate marginal of the target distribution. In the context of Bayesian latent variable models, the MCMC typically operates on the hyperparameters, and the subsequent weighting may be based on IS or sequential Monte Carlo (SMC), but allows for multilevel techniques as well. The IS approach provides a natural alternative to delayed acceptance (DA) pseudo-marginal/particle MCMC, and has many advantages over DA, including a straightforward parallelisation and additional flexibility in MCMC implementation. We detail minimal conditions which ensure strong consistency of the sug…
Mixture Hidden Markov Models for Sequence Data: The seqHMM Package in R
Sequence analysis is being more and more widely used for the analysis of social sequences and other multivariate categorical time series data. However, it is often complex to describe, visualize, and compare large sequence data, especially when there are multiple parallel sequences per subject. Hidden (latent) Markov models (HMMs) are able to detect underlying latent structures and they can be used in various longitudinal settings: to account for measurement error, to detect unobservable states, or to compress information across several types of observations. Extending to mixture hidden Markov models (MHMMs) allows clustering data into homogeneous subsets, with or without external covariate…
KFAS : Exponential Family State Space Models in R
State space modelling is an efficient and flexible method for statistical inference of a broad class of time series and other data. This paper describes an R package KFAS for state space modelling with the observations from an exponential family, namely Gaussian, Poisson, binomial, negative binomial and gamma distributions. After introducing the basic theory behind Gaussian and non-Gaussian state space models, an illustrative example of Poisson time series forecasting is provided. Finally, a comparison to alternative R packages suitable for non-Gaussian time series modelling is presented.
Improved frequentist prediction intervals for ARMA models by simulation
[Introduction] In a traditional approach to time series forecasting, prediction intervals are usually computed as if the chosen model were correct and the parameters of the model completely known, with no reference to the uncertainty regarding the model selection and parameter estimation. The parameter uncertainty may not be a major source of prediction errors in practical applications, but its effects can be substantial if the series is not too long. The problems of interval prediction are discussed in depth in Chatfield (1993, 1996) and Clements & Hendry (1999). [Continues; please see the article] nonPeerReviewed
dynamite: An R Package for Dynamic Multivariate Panel Models
dynamite is an R package for Bayesian inference of intensive panel (time series) data comprising of multiple measurements per multiple individuals measured in time. The package supports joint modeling of multiple response variables, time-varying and time-invariant effects, a wide range of discrete and continuous distributions, group-specific random effects, latent factors, and customization of prior distributions of the model parameters. Models in the package are defined via a user-friendly formula interface, and estimation of the posterior distribution of the model parameters takes advantage of state-of-the-art Markov chain Monte Carlo methods. The package enables efficient computation of …
A Bayesian Reconstruction of a Historical Population in Finland, 1647–1850
This article provides a novel method for estimating historical population development. We review the previous literature on historical population time-series estimates and propose a general outline to address the well-known methodological problems. We use a Bayesian hierarchical time-series model that allows us to integrate the parish-level data set and prior population information in a coherent manner. The procedure provides us with model-based posterior intervals for the final population estimates. We demonstrate its applicability by estimating the long-term development of Finlands population from 1647 onward and simultaneously place the country among the very few to have an annual popula…
From Sequences to Variables : Rethinking the Relationship between Sequences and Outcomes
Sequence analysis is increasingly used in the social sciences for the holistic analysis of life-course and other longitudinal data. The usual approach is to construct sequences, calculate dissimilarities, group similar sequences with cluster analysis, and use cluster membership as a dependent or independent variable in a regression model. This approach may be problematic, as cluster memberships are assumed to be fixed known characteristics of the subjects in subsequent analyses. Furthermore, it is often more reasonable to assume that individual sequences are mixtures of multiple ideal types rather than equal members of some group. Failing to account for uncertain and mixed memberships may l…