`sparsegl`

This package provides tools for fitting regularization paths for sparse group-lasso penalized learning problems. The model is fit for a sequence of the regularization parameter lambda.

The strengths and improvements that this package achieved compared to other sparse group-lasso packages are as follows:

Fortran significantly speeds up the sparse group-lasso model building process.

“Strong rule” is implemented during coordinate descent step, screening out zero-coefficient variables which might not be detected by some conservative criteria.

`X`

can be a sparse input matrix passed as an argument in`sparsegl()`

, and it does not have to be a regular matrix.Plots are made with functions package

`ggplot2`

instead of function`plot`

from`base`

, which provide better appearances and experience for visualization.An extra function

`estimate_risk()`

evaluates the quality of fitted models in terms of different information criteria and provides a means for model selection. It is also an alternative for model selection if cross-validation is too costly.

This package is not on CRAN yet, so it can be installed using the [`devtools`

] (https://cran.r-project.org/package=devtools) package:

`::install_github("dajmcdon/sparsegl", ref = "main") devtools`

Building the vignettes, such as this getting started guide, takes a significant amount of time. They are not included in the package by default. If you want to include vignettes, then use this modified command:

```
::install_github("dajmcdon/sparsegl", ref = "main",
devtoolsbuild_vignettes = TRUE, dependencies = TRUE)
```

For this getting-started vignette, firstly, we will randomly generate `X`

, an input matrix of predictors of dimension `n`

-obs by `p`

-feature. To initiate `y`

, a real-valued vector (make it a vector rather a matrix),

- Linear Regression model: \(y = X\beta^* + \epsilon\).
- Logistic regression model: \(y = (y_1, y_2, \cdots, y_n)\), where \(y_i \sim \text{Bernoulli}\left(\frac{1}{1 + \exp(-X_i \beta^*)}\right)\), \(i = 1, 2, \cdots, n.\)

where the coefficient vector \(\beta^*\) is specified as below, and the white noise \(\epsilon\) following standard normal distribution serves as data variation. Then the sparse group-lasso problem is formulated as the sum of mean squared error ( linear regression) or logistic loss (logistic regression) and a convex combination of lasso penalty and group lasso penalty:

- Linear regression: \[ \min_{\beta\in\mathbb{R}^p}\left(\frac{1}{2n} \rVert y - \sum_g X^{(g)}\beta^{(g)}\rVert_2^2 + (1-\alpha)\lambda\sum_g \sqrt{\text{bs}}\rVert\beta^{(g)}\rVert_2 + \alpha\lambda\rVert\beta\rVert_1 \right) \qquad (*). \]
- Logistic regression: \[ \min_{\beta\in\mathbb{R}^p}\left(\frac{1}{2n}\sum_{i=1}^n \log\left(1 + \exp\left(-y_i\beta^\top X_i\right)\right) + (1-\alpha)\lambda\sum_g \sqrt{\text{bs}}\rVert\beta^{(g)}\rVert_2 + \alpha\lambda\rVert\beta\rVert_1 \right) \qquad (**). \]

where

\(X^{(g)}\) is the submatrix of \(X\) with columns corresponding to the features in group \(g\).

\(\beta^{(g)}\) is the corresponding coefficients of the features in group \(g\).

`bs`

is the length of \(\beta^{(g)}\).\(\alpha\) adjusts the weight between lasso penalty and group-lasso penalty.

\(\lambda\) fine-tunes the size of penalty imposed on the model to control the number of nonzero coefficients of features, and furthermore, to avoid overfitting.

```
library(sparsegl)
set.seed(1010)
<- 100
n <- 200
p <- matrix(data = rnorm(n*p, mean = 0, sd = 1), nrow = n, ncol = p)
X <- c(rep(5, 5), c(5, -5, 2, 0, 0), rep(-5, 5), c(2, -3, 8, 0, 0), rep(0, (p - 20)))
beta_star <- rep(1:(p / 5), each = 5)
groups
# Linear regression model
<- rnorm(n, mean = 0, sd = 1)
eps <- X %*% beta_star + eps
y
# Logistic regression model
<- 1 / (1 + exp(-X %*% beta_star))
pr <- rbinom(n, 1, pr) y_binary
```

`sparsegl()`

Given an input matrix `X`

, and a response vector `y`

(or a matrix with 1 column), a sparse group-lasso regularized linear model is fitted for a sequence of penalty parameter values in terms of penalized maximum likelihood. The penalty is composed of lasso penalty and group lasso penalty. The other main arguments the users might give are:

`group`

: a vector with consecutive integers of length`p`

indicating the grouping of the features. By default, each group only contains one feature if without initialization.`family`

: A character string specifying the likelihood to use, could be either linear regression “gaussian” or logistic regression loss “binomial”. Default is “gaussian”.`pf`

: a penalty vector of the length of the number of groups in total. Default value for each entry is the square-root of the corresponding size of each group, which is \(\sqrt{\text{bs}}\) in \((*)\) and \((**)\) above.`asparse`

: changes the weight of lasso penalty, referring to \(\alpha\) in \((*)\) and \((**)\) above:`asparse`

= \(1\) gives the lasso penalty only.`asparse`

= \(0\) gives the group lasso penalty only. The default value of`asparse`

is \(0.05\).`lower_bnd`

: lower bound for coefficient values, a vector in length of 1 or the number of groups including non-positive numbers only. Default value for each entry is -\(\infty\).`upper_bnd`

: upper bound for coefficient values, a vector in length of 1 or the number of groups including non-negative numbers only. Default value for each entry is \(\infty\).

It returns a `sparsegl`

object, and the main attribute of this object is:

`df`

: the number of nonzero coefficients for each lambda value. This only represents an approximation to the exact degree-of-freedom. Detailed explanation can be checked under function`estimate_risk()`

below.

`<- sparsegl(X, y, group = groups) fit1 `

`plot()`

for `sparsegl`

objectThis function produces nonzero-only coefficient curves for each penalty parameter `lambda`

values in the regularization path for a fitted `sparsegl`

object. The arguments of this function are:

`y_axis`

: can be set with either “coef” or “group”. Default is “coef”.`x_axis`

: can be set with either “lambda” or “penalty”. Default is “lambda”.

To elaborate on `y_axis`

and `x_axis`

:

The plot with

`y_axis`

= “group” is the group norms against the log-`lambda`

or the scaled group norm vector. Each group norm is defined by: \[ \text{asparse}\times\rVert\beta^{(g)}\rVert_1 + (1 - \text{asparse})\times\sum_g\rVert\beta^{(g)}\rVert_2 \] Curves are plotted in the same color if the corresponding features are in the same group. Notice that the number of curves shown on the plots could be less than the actual number of groups since only the groups containing nonzero features at least at one lambda in the sequence`lambda`

are included.The plot with

`y_axis`

= “coef” is the coefficients of features against the log-`lambda`

or the scaled group norm vector. Each curve with a distinct color represents a feature. Again, only the features having nonzero coefficients at least at one lambda value in the sequence`lambda`

are considered.The plot with

`x_axis`

= “lambda” indicates the`x_axis`

presenting with log-`lambda`

.The plot with

`x_axis`

= “penalty” indicates the`x_axis`

presenting with the scaled group norm vector. Each element in this vector is defined by: \[ \frac{\text{asparse}\times\rVert \beta\rVert_1 + (1-\text{asparse})\times\sum_g\rVert \beta^{(g)}\rVert_2}{\max_\beta\left(\text{asparse}\times\ \rVert \beta\rVert_1 + (1-\text{asparse})\times\sum_g\rVert \beta^{(g)}\rVert_2\right)} \]

`plot(fit1, y_axis = "group", x_axis = "lambda")`

`plot(fit1, y_axis = "coef", x_axis = "penalty", add_legend = FALSE)`

`coef()`

, `predict()`

and `print()`

`sparsegl`

objectAll three functions consume a fitted `sparsegl`

object as an arguments and

`coef()`

and`predict()`

return a matrix of coefficients and predictions \(\hat{y}\) given a matrix`X`

at each lambda respectively, unless an optional argument`s`

is assigned with a specified vector or numeric value from (but not necessarily) the original`lambda`

sequence.`print()`

returns the number of nonzero features with nonzero coefficients at each`lambda`

.

```
<- coef(fit1, s = c(0.02, 0.03))
coef <- predict(fit1, newx = X[100,], s = fit1$lambda[2:3])
pred print(fit1)
```

`cv.sparsegl()`

This function does a k-fold cross-validation (cv) on `sparsegl`

. It takes the same arguments `X`

, `y`

, `group`

, which are specified above, with additional argument `pred.loss`

. It can be set with either “L2” or “L1” for linear regression model, “loss” or “misclass” for logistic regression model indicating the loss to use for cv error. This will return a `cv.sparseg`

object.

`<- cv.sparsegl(X, y, group = groups, pred.loss = "L1") fit_l1 `

`plot()`

, `coef()`

and `predict()`

for `cv.sparsegl`

object`plot()`

: produces cross-validation curve with upper and lower confidence bounds plots for each`lambda`

in the regularization path for a fitted`cv.sparsegl`

object. In the generated plot, the red points represent the`cvm`

values at each`lambda`

, and the attached gray vertical error bars indicate the uncertainty of`cvm`

.

`plot(fit_l1)`

`coef()`

and`predict()`

for`cv.sparsegl`

object work similarly to those introduced above. The only differences are that the argument`object`

passes a fitted`cv.sparsegl`

object, and`s`

can also be set with`lambda.1se`

,`lambda.min`

in addition.

```
<- coef(fit_l1, s = "lambda.1se")
coef <- predict(fit_l1, newx = X[50:80, ], s = "lambda.min") pred
```

`estimate_risk()`

This function returns the information criterion, which is the sum of the maximum log-likelihood and a penalty term determined by the chosen penalty type for a `sparsegl`

model at each `lambda`

. It provides a means for model selection by representing the trade-off between the goodness of fit of the model and the complexity of the model. It takes the same arguments `X`

and `y`

from the function `sparsegl()`

. The additional arguments it needs are:

`object`

: a fitted`sparsegl`

object.`type`

: three types of penalty used for calculation:AIC (Akaike information criterion): 2 * df / n

BIC (Bayesian information criterion): log(n) * df / n

GCV (Generalized cross validation): -2 * log(1 - df / n)

where df is the degree-of-freedom, and n is the sample size.

`approx_df`

: indicate if a vector of the approximation to the degree-of-freedom at each penalty parameter`lambda`

is used. Default is FALSE and the program will compute the unbiased estimate of the exact degree-of-freedom.

FYI: Degree-of-freedom is a tool to assess the complexity of a statistical modeling procedure. `object$df`

, the approximation to the degree-of-freedom is the number of nonzero coefficients of the model. Notice that it would take some time to calculate the unbiased estimate of the exact degreeo-of-freedom if `X`

is complicated. For more details about how to realize this calculation, our method is implemented based on the paper https://arxiv.org/pdf/1212.6478.pdf.

`<- estimate_risk(fit1, X, y, type = "AIC", approx_df = FALSE) risk `