Please see the arXiv paper for details. We denote the R package as dsos
, to avoid confusion with D-SOS
, the method.
We show how easy it is to implement D-SOS
for a particular notion of outlyingness. Suppose we want to test for no adverse shift based on isolation scores in the context of multivariate two-sample comparison. To do so, we need two main ingredients: a score function and a method to compute the \(p-\)value.
First, the scores are obtained using predictions from isolation forest with the isotree
package (Cortes 2020). Isolation forest detects isolated points, instances that are typically out-of-distribution relative to the high-density regions of the data distribution. Naturally, any performant method for density-based out-of-distribution detection can effectively be used to achieve the same goal. Isolation forest just happens to be a convenient way to do this. The internal function outliers_no_split
shows the implementation of one such score function in the dsos
package.
dsos:::outliers_no_split
## function (x_train, x_test, num_trees = 500)
## {
## iso_fit <- isotree::isolation.forest(data = x_train, ntrees = num_trees)
## os_train <- predict(iso_fit, newdata = x_train)
## os_test <- predict(iso_fit, newdata = x_test)
## return(list(test = os_test, train = os_train))
## }
## <bytecode: 0x000000001d5ae0f0>
## <environment: namespace:dsos>
Second, we estimate the empirical null distribution for the \(p-\