I've started thinking a little more about what I'm doing when I use something like: rear db (from Stan) or reasoning gym (From TensorFlow Probability) Evaluate the sampler.

posteriordb provides 10,000 reference drawings from the Stan program and its posterior. Reference drawings must be produced by a separate sampler. In some cases, such as Neil's funnel or high-dimensional normals, you can simply perform a forward simulation to obtain an independent drawing. For models with data, you don't actually know the post-hoc analysis. That's why we're running MCMC in the first place. I think what postiordb did was that he ran NUTS for enough iterations that he could thin it out to about 10,000 independent draws.

**standard error of reference**

standard error of An independent drawing that estimates a specific quantity, such as the posterior mean of a parameter, is

where sd is the standard deviation of the posterior quantity. The standard Monte Carlo method scales horribly due to its square root. Markov chain Monte Carlo methods can be even worse. This is because the effective sample size, rather than the number of draws, can be much smaller than the number of iterations for the more difficult post-hoc analysis. To the correlation of draws.

because , for 10,000 draws, the standard error is only 1/100 of the standard deviation. Generally, it takes 4 times as many draws to cut the error in half, and 100 times as many draws to cut the error by 10 times and add decimal precision. That is, you can't reasonably expect something like postiordb to save enough drawings to allow fine-grained convergence evaluation. A slightly more economical alternative is to store long-term averages. So we cold run 1 million iterations and save the parameter and squared parameter estimates. If you assign 100 jobs to a cluster, you can run 100 million jobs. However, it is still only accurate to three or four decimal places. You might then be able to squeeze out one or two decimal places using control variables.

This is a large amount of residual error that must be dealt with when evaluating algorithms that asymptotize to the correct answer. The only thing that saves us is that the algorithm we are evaluating is also a sampler and is subject to the same inverse square root reduction rate of the standard error.

**Report a rating error**

What we really want to report is the standard error of the residual between the new model estimate and the truth. But we usually don't know the truth. In practice, you typically do one of two things, depending on the computing intensity, computing resources, and patience you have to fit the model.

*Simple model:*. Run the new algorithm 100 or 1000 times and report something like a histogram of errors. With this many runs, the standard error can be directly estimated (assuming the plot is unbiased), and if the true value is known, it can be estimated more closely without any assumptions. .

*Hard model:*. Run the new algorithm once or twice, use the new algorithm's estimated standard error, and report it instead of the empirical histogram of errors. This has implications because the error in estimating the effective sample size can be large.

The problem with both of these approaches is that they treat the reference drawing as true and only report the variability of the new algorithm. We need to go one step further and incorporate the error from postiordb or another reference database into the new algorithm's error uncertainty report. Additional errors can be performed in simulation (perhaps analytically, where normals are useful that way).

Reporting a single univariate mean is easy enough. I don't know what the KL divergence itself has a Monte Carlo integral on, or how to do this to evaluate errors such as the Wasserstein distance.inside pathfinder paperWe used the semi-discrete Wassertein distance algorithm. To be honest, I don't even know how that algorithm works.

**Evaluating late rejections**

I ran into this problem because I want to evaluate our product. Delayed Rejection HMC Indicates that the error becomes zero when more draws are made. Specifically, I would like to extend the diagram in the paper showing that the standard His HMC and NUTS fail, but the DR-HMC succeeds in sampling from a Neil's funnel-like multiscale distribution with a central parameterization. I think.

There are two basic problems with this.

*Question 1*: By the time the new algorithm's effective sample size exceeds 10,000, the estimation error is dominated by the standard error of postiordb. I don't even know the actual error in postiordb. We only know the estimated standard error assuming that the draws are independent.

*Problem 2*: The second problem is that there is no guarantee that the reference drawing is an unbiased drawing after the fact. For example, in the Neal funnel center parameterization, there is no HMC step size that adequately explores the post-hoc analysis. Each step size selects a range of log-scale parameters that can be explored (as we demonstrated in the delayed rejection paper). Standard error estimates from reference drawings assume independent drawings from posterior drawings, rather than biased drawings. If the posterior database draws are biased, then as the number of posterior database and new sampler draws grows to infinity, the error reported for a perfect sampler will not be zero.

**help?**

Does anyone have any suggestions on what can be done in the face of potentially biased probabilistic evaluation samples? I'll wait until Thursday's Stan meeting and ask Aki for inspiration on topics like this. I could have asked, but I wanted to ask a wider group and see what would happen.