```
library(CommKern)
##
## Attaching package: 'CommKern'
## The following object is masked from 'package:stats':
##
## kernel
```

In the past decade, network data has become a popular method for
analyzing data from a variety of fields, including genomics,
metabolomics, ecology, social networks, and neuroimaging. Brain networks
are a complex system of structural-functional couplings within a
hierarchy of mesoscopic interactions. Because of the unique aspects to
brain connectivity, many of the popular graph theoretic measures fail to
capture its complex dynamics. The **CommKern** package was
created to address the limitations in analysis of network community
structures through two main aspects. The first is a hierarchical,
multimodal spinglass algorithm that is computationally efficient and
able to robustly identify communities using multiple sources of
information. The flexibility of the multimodal and hierarchical aspects
to this algorithm set it apart from other existing methods which may
only allow for one source of information or lack the hierarchical
option. The second is a semiparametric kernel test of association, which
allows for the use of a kernel to model a variety of cluster evaluation
metrics without making any assumption as to the parametric form of its
association with the outcome. As well, because the model is
semi-parametric, potential confounders can be controlled for within a
parametric framework, allowing for ease of parameter estimate
interpretation, should they be desired.

This vignette is designed to provide an overview of
**CommKern**’s functionality using the simulated datasets
provided within the package. While focusing on applications to
neuroimaging, the methodologies within this package are flexible enough
to be applied to any type of network data, which will be pointed out in
the vignette when applicable.

Several simulated datasets have been provided within the package.

`SBM_net`

`SBM_net`

is a dataset containing multimodal network
information simulated to emulate functional and structural connectivity
within a nested hierarchical structure.

`matrix_plot(SBM_net)`

The `SBM_net`

object is the result of calling
`matrix_to_df`

.

```
<- matrix_to_df(func_mat = SBM_net$func_mat, str_mat = SBM_net$str_mat)
net identical(net, SBM_net)
## [1] TRUE
```

Network data used within the package’s community detection algorithm
takes a specific form. Using functional and structural adjacency
matrices as inputs, the `matrix_to_df`

function creates a
list of five dataframes:

`func_edges`

, which describes the pairwise functional edge weights between nodes.`str_edges`

, which describes the pairwise structural edge weights between nodes.`vertexes`

, which describes nodal characteristics, including functional degree, structural degree, node label (which can contain additional information about the node), and community assignment (for later use).`func_matrix`

, which is the functional adjacency matrix.`str_matrix`

, which is the structural adjacency matrix.

```
str(SBM_net)
## List of 5
## $ func_edges :'data.frame': 1233 obs. of 3 variables:
## ..$ func_start_node: int [1:1233] 2 3 4 5 6 7 8 9 10 11 ...
## ..$ func_end_node : int [1:1233] 1 1 1 1 1 1 1 1 1 1 ...
## ..$ func_weight : num [1:1233] 0.715 0.734 0.739 0.719 0.769 ...
## $ str_edges :'data.frame': 453 obs. of 3 variables:
## ..$ str_start_node: int [1:453] 2 3 5 6 7 8 9 10 13 15 ...
## ..$ str_end_node : int [1:453] 1 1 1 1 1 1 1 1 1 1 ...
## ..$ str_weight : num [1:453] 0.611 0.598 0.601 0.617 0.594 ...
## $ vertexes :'data.frame': 80 obs. of 5 variables:
## ..$ node_id : int [1:80] 1 2 3 4 5 6 7 8 9 10 ...
## ..$ node_label : logi [1:80] NA NA NA NA NA NA ...
## ..$ func_degree: num [1:80] 11.2 11.3 11 11.5 11.1 ...
## ..$ str_degree : num [1:80] 5.81 5.65 6.27 4.03 3.37 ...
## ..$ community : logi [1:80] NA NA NA NA NA NA ...
## $ func_matrix: num [1:80, 1:80] 0 0.715 0.734 0.739 0.719 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:80] "1" "2" "3" "4" ...
## .. ..$ : chr [1:80] "1" "2" "3" "4" ...
## $ str_matrix : num [1:80, 1:80] 0 0.611 0.598 0 0.601 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:80] "1" "2" "3" "4" ...
## .. ..$ : chr [1:80] "1" "2" "3" "4" ...
## - attr(*, "class")= chr "spinglass_net"
```

`simasd_covars`

This dataframe is a simulated dataset of demographics based on summary statistics for a random subset of the ABIDE pre-processed database.

```
str(simasd_covars)
## 'data.frame': 49 obs. of 8 variables:
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ dx_group : num 1 1 1 1 0 0 0 0 0 1 ...
## $ sex : int 0 0 1 0 1 1 0 0 0 1 ...
## $ age : num 25.2 18.9 16.1 27.6 37.6 20.6 7.7 21.6 19 25.5 ...
## $ handedness : num 2 0 0 0 1 0 1 0 0 0 ...
## $ fullscale_IQ: num 107 123 100 109 114 130 118 120 101 106 ...
## $ verbal_IQ : num 105 125 96 107 118 136 122 124 103 103 ...
## $ nonverbal_IQ: num 112 120 109 113 113 124 116 117 104 112 ...
```

`simasd_comm_df`

This is a dataset of partitions of nodes to communities from
simulated group-level networks with community structures. This dataset
is complementary to the `simasd_covars`

dataset and is
derived from the `simasd_array`

dataset.

```
str(simasd_comm_df, max.level = 0)
## 'data.frame': 80 obs. of 49 variables:
```

`simasd_hamil_df`

This is a dataset of Hamiltonian values from simulated group-level
networks with community structure. This dataset is complementary to the
`simasd_covars`

dataset and is derived from the
`simasd_array`

dataset.

```
str(simasd_hamil_df)
## 'data.frame': 49 obs. of 2 variables:
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ hamil: num -342 -347 -327 -361 -363 ...
```

`simasd_array`

This is a dataset containing an array of simulated adjacency
matrices. The dimensions of each matrix is 80 x 80, for a total of 49
simulated networks. This simulated array is the basis of the
`simasd_hamil_df`

and `simasd_comm_df`

datasets
and is complementary to the `simasd_covars`

dataframe.

```
str(simasd_array)
## num [1:49, 1:80, 1:80] 0 0 0 0 0 0 0 0 0 0 ...
```

The hierarchical multimodal spinglass (HMS) algorithm was created as a way to address the unique characteristics of brain connectivity, specifically focusing on the hierarchical aspect to brain network segregation and integration and the structure-function relationship.

First introduced by Reichardt and Bornholdt in 2006, the spinglass algorithm relies on the analogy between the Potts spinglass model - a popular statistical mechanics model - and community detection, where any community partitioning is likened to a corresponding spin configuration and the problem of community detection falls into finding the ground state of the Potts spinglass model. Building on the idea of a quality function which, when optimized, represents the ideal partitioning of a network, they determined that a quality function should embody four components: it should (a) reward internal edges between nodes of the same group/spin state; (b) penalize missing edges (non-links) between nodes in the same group; (c) penalize existing edges between different groups/spin states; and (d) reward non-links between different groups. Using these requirements, they built the following function: \[\begin{equation} \begin{split} \mathcal{H}\left(\left\{\mathbf{C}\right\}\right) = & -\sum_{i \neq j}a_{ij}\underbrace{W_{ij} \delta \left(C_i,C_j\right)}_{\text{internal links}} + \sum_{i \neq j}b_{ij}\underbrace{\left(1-W_{ij}\right) \delta \left(C_i,C_j\right)}_{\text{internal non-links}} \\ & + \sum_{i \neq j}c_{ij}\underbrace{W_{ij} \left(1-\delta \left(C_i,C_j\right)\right)}_{\text{external links}} - \sum_{i \neq j}d_{ij}\underbrace{\left(1-W_{ij}\right) \left(1-\delta \left(C_i,C_j\right)\right)}_{\text{external non-links}}, \end{split} \end{equation}\] where \(W_{ij}\) denotes the weighted adjacency matrix with \(W_{ij} \geq 0\) if an edge is present and zero otherwise, \(C_{i} \in \left\{1,2,...,q\right\}\) denotes the group index (or spin state) of node \(i\) in the graph, \(\delta\) is the Kronecker product equaling one if \(C_i\) and \(C_j\) belong to the same community and zero otherwise, and \(a_{ij}, b_{ij}, c_{ij}, d_{ij}\) denote the weights of the individual contributions, respectively. One of the appeals of the spinglass algorithm is that the number of groups \(q\) only determines the maximum number of groups allowed to exist and, in principle, can be as large as the total number of nodes in the network, \(N\). All group indices specified by \(q\) have to be used during the optimal assignment of nodes to communities, though some may be unpopulated in the ground state.

Generally, links and non-links are weighted equally, \(a_{ij}=c_{ij}\) and \(b_{ij}=d_{ij}\), which simplifies the algorithm to only need to consider internal links and non-links. A convenient choice for \(a_{ij}\) and \(b_{ij}\) is \(a_{ij} = 1-\gamma p_{ij}\) and \(b_{ij} = \gamma p_{ij}\), where \(p_{ij}\) is the probability that a non-zero edge exists between nodes \(i\) and \(j\), normalized such that \(\sum_{i \neq j}p_{ij} = 2M\), where \(M\) is the total weight (or sum of all edge weights) of the graph. Adopting all of these settings, we can simplify the above equation to

\[\begin{equation} \begin{split} \mathcal{H}\left(\left\{\mathbf{C}\right\}\right) = & -\sum_{i \neq j}\left(W_{ij} - \gamma p_{ij}\right) \delta \left(C_{i},C_{j}\right). \end{split} \end{equation}\]

Expanding on the original algorithm of Reichardt and Bornholdt, in 2012, Eaton and Mansbach introduced the concept of semi-supervised community detection to the spinglass model. When knowledge can be used to inform the community search, either through knowledge of a specific community of interest or partial knowledge of community memberships, its incorporation can be used to help compensate for potential noise in the network. As the Hamiltonian captures the total energy of the Potts model, external knowledge can be incorporated into community detection by penalizing for community structures that violate guidance. Generally, let the disagreement of the communities to a given guidance be specified by a function \(U: C \rightarrow \mathbb{R}\), where \(U(C)\) is smaller when the communities \(C\) agree with the guidance and larger when they disagree, with the following form:

\[\begin{equation} U(C) = \sum_{i \neq j} \left( u_{ij} \left( 1-\delta\left(C_{i},C_{j}\right) \right) \pm \bar{u}_{ij} \delta \left(C_{i},C_{j}\right)\right), \end{equation}\] where \(u_{ij}\) is the penalty for violating guidance that \(v_{i}\) and \(v_{j}\) belong to the same community and \(\bar{u}_{ij}\) is the penalty for violating guidance that \(v_{i}\) and \(v_{j}\) belong to different communities. When incorporating the guidance into the Hamiltonian,

\[\begin{equation} \mathcal{H}'\left( C \right) = \mathcal{H}\left( C \right) + \mu \sum_{i \neq j} \left( u_{ij} - \left(u_{ij}-\bar{u}_{ij}\right) \delta \left(C_{i},C_{j}\right)\right), \end{equation}\] where \(\mu \geq 0\) controls the balance between the inherent community structure and the external guidance. Rewriting \(\mathcal{H}'\) to be more conducive to optimization, we can decompose it as

\[\begin{equation} \mathcal{H}'\left( C \right) = -\sum_{i \neq j}M_{ij}\delta\left(C_{i},C_{j}\right) - \mu \sum_{i \neq j}\Delta U_{ij} \delta\left(C_{i},C_{j}\right)+K, \end{equation}\] where \(M_{ij}=A_{ij}-\gamma P_{ij}\), \(\Delta U_{ij}=u_{ij} - \bar{u}_{ij}\), and all constant terms are contained within \(K = \mu \sum_{i \neq j}u_{ij}\), which can ultimately be dropped from the optimization problem.

Adapting Eaton and Mansbach’s method for multimodal community detection in the human brain, our modified Hamiltonian takes the following form,

\[\begin{equation} \mathcal{H}'\left( C \right) = - \sum_{i \neq j} M_{ij}\delta\left(C_{i},C_{j}\right) - \alpha \sum_{i \neq j} S_{ij} \delta\left(C_{i},C_{j}\right), \end{equation}\] where \(M_{ij}\) is the modularity matrix associated with functional connectivity, \(S_{ij}\) is the structural connectivity matrix, and \(\alpha \geq 0\) controls the balance between the community structure determined solely by functional connectivity and being informed by the structural connectivity via white matter fiber tract information. In contrast to Eaton and Mansbach’s approach, where penalties are applied when violating guidance for nodes belonging both to the same community or different communities, our method only decreases the total energy of the model when white matter fiber tracts exist between regions \(v_i\) and \(v_j\); otherwise, the guidance term is set to zero, with the algorithm deciding the community assignments for \(v_i\) and \(v_j\) based solely on their functional connectivity.

To then study the hierarchical community structure of these networks, we implement an iterative method. Specifically, we create a multilayer network, where communities from layer \(k\) each become subnetworks in layer \(k+1\) wherein the multimodal spinglass algorithm is applied. The number of iterations (or layers) is pre-specified in the algorithm to avoid the need for stopping criteria. The number of communities \(q\) is held constant across each layer of the network. An additional parameter, \(\tau\), is also specified a priori, controlling the acceptance rate of the algorithm.

Before implementing the algorithm within the
**CommKern** package, we start from a common brain atlas
(e.g., Automated Anatomical Labeling (AAL), Harvard-Oxford (HO),
Craddock 200 (CC200), etc.), where adjacency matrices representing
functional and structural connectivity are derived from the correlation
between resting state functional MRI (rs-FMRI) blood oxygen level
dependent (BOLD) time series or the number of white matter fiber tracks
computed from diffusion tensor MRI, respectively.

Once we have structural and functional adjacency matrices, we must
create the network dataframe used within the `hms`

algorithm
function. This is done using the `matrix_to_df`

function
(Note: the `SBM_net`

dataset is already in this form for the
end user). It should be noted that both a functional and structural
matrix need to be specified for the `matrix_to_df`

function
to work. If there is no structural (or guidance) matrix for use, the end
user can simply create a matrix of \(0\)s of the same dimensions for input into
the function. This will not affect the end results of the
`hms`

function.

```
<- matrix_to_df(func_mat = SBM_net$func_mat, str_mat = SBM_net$str_mat)
net identical(net, SBM_net)
## [1] TRUE
```

From here, we can implement the `hms`

function. There are
several input parameters to discuss.

```
str(hms)
## function (input_net, spins, alpha, coolfact, tol, max_layers)
```

The `spins`

parameter denotes the *maximum* number
of spins, or communities, the algorithm can utilize during optimization.
The whole number integer value of `spins`

must lie within
\(\left[2,\texttt{ # of network
nodes}\right]\). However, the higher the number of spins, the
more complex the optimization process, leading to longer computation
times. If the number of communities is unknown, we recommend an
exploratory analysis to determine the number of communities the
algorithm tends to decide on using less restrictive parameters.

The `alpha`

parameter is a numeric parameter \(\left(\geq 0\right)\) that balances the use
of the guidance matrix in the algorithm’s optimization. When \(\alpha=0\), the algorithm ignores the
guidance matrix completely and when \(\alpha=1\), the algorithm weights the
functional and structural inputs equally.

The `coolfact`

parameter indicates how quickly (or slowly)
to cool the heatbath algorithm, which is used to find the optimal
partitioning of nodes to communities. This parameter must be in \(\left(0,1\right)\), but is usually set to
be \(0.95-0.99\).

The `tol`

parameter indicates the tolerance level of
accepting the proposed changes within a temperature; at the end of each
sweep, the number of proposed changes to the partition is assessed to
see if it exceeds a threshold determined as a function of tol and spins,
typically set to be 0.01-0.05.

Finally, `max_layers`

is the parameter controlling the
number of layers of communities within the network. At least one layer
must be specified in the algorithm. End users can specify the number of
layers to be any positive whole number but several considerations should
be made. First, as the number of layers increases, so does the
computational time. Second, over-specification of the community layers
can create singleton issues (a single node in a community), leading to
the algorithm throwing an error. Finally, the number of layers specified
should be data driven; therefore, we cannot provide a universally
optimal number of layers to specify.

Using the `SBM_net`

dataset as an example, below we have
run the `hms`

algorithm to find the first layer of
bipartitioning, suppressing the guidance matrix in the optimization.

The output from the `hms`

function call has two components
within a list structure:

`comm_layers_tree`

, a dataframe whose first column is the node ID (pulled from the input network object) and all other columns denote the partitioning of nodes to communities across layers.`best_hamiltonian`

, a vector of the optimized Hamiltonian values for each run of the HMS algorithm.

```
<-
hms_object hms(
input_net = SBM_net,
spins = 2,
alpha = 0,
coolfact = 0.99,
tol = 0.01,
max_layers = 1)
str(hms_object)
## List of 3
## $ comm_layers_tree:'data.frame': 80 obs. of 2 variables:
## ..$ node_id: int [1:80] 1 2 3 4 5 6 7 8 9 10 ...
## ..$ layer_1: int [1:80] 2 2 2 2 2 2 2 2 2 2 ...
## $ net :List of 5
## ..$ func_edges :'data.frame': 1233 obs. of 3 variables:
## .. ..$ func_start_node: int [1:1233] 2 3 4 5 6 7 8 9 10 11 ...
## .. ..$ func_end_node : int [1:1233] 1 1 1 1 1 1 1 1 1 1 ...
## .. ..$ func_weight : num [1:1233] 0.715 0.734 0.739 0.719 0.769 ...
## ..$ str_edges :'data.frame': 453 obs. of 3 variables:
## .. ..$ str_start_node: int [1:453] 2 3 5 6 7 8 9 10 13 15 ...
## .. ..$ str_end_node : int [1:453] 1 1 1 1 1 1 1 1 1 1 ...
## .. ..$ str_weight : num [1:453] 0.611 0.598 0.601 0.617 0.594 ...
## ..$ vertexes :'data.frame': 80 obs. of 5 variables:
## .. ..$ node_id : int [1:80] 1 2 3 4 5 6 7 8 9 10 ...
## .. ..$ node_label : logi [1:80] NA NA NA NA NA NA ...
## .. ..$ func_degree: num [1:80] 11.2 11.3 11 11.5 11.1 ...
## .. ..$ str_degree : num [1:80] 5.81 5.65 6.27 4.03 3.37 ...
## .. ..$ community : logi [1:80] NA NA NA NA NA NA ...
## ..$ func_matrix: num [1:80, 1:80] 0 0.715 0.734 0.739 0.719 ...
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : chr [1:80] "1" "2" "3" "4" ...
## .. .. ..$ : chr [1:80] "1" "2" "3" "4" ...
## ..$ str_matrix : num [1:80, 1:80] 0 0.611 0.598 0 0.601 ...
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : chr [1:80] "1" "2" "3" "4" ...
## .. .. ..$ : chr [1:80] "1" "2" "3" "4" ...
## ..- attr(*, "class")= chr "spinglass_net"
## $ best_hamiltonian: num -286
## - attr(*, "class")= chr "spinglass_hms"
```

The HMS algorithm is flexible enough that either the multimodal or hierarchical aspects can be suppressed. If the multimodal aspect would like to be suppressed, set \(\alpha=0\) and if the hierarchical aspect would like to be suppressed, set \(\text{max_layers}=1\). If both are suppressed, the algorithm simplifies to the spinglass algorithm originally proposed by Reichardt and Bornholdt.

Once the algorithm has finished running, you can visualize the
results in a plot using `community_plot`

. Using the output
from the `hms`

run of the `SBM_net`

dataset, we
can produce the following plot:

`community_plot(hms_object)`

If the HMS algorithm is used on network data for which ground truth
is unknown (like in the case of brain connectivity), the
**CommKern** package has provided several functions to
understand the robustness and stability of the partitioning of nodes to
communities across multiple runs.

First is the `community_allegiance`

function.For node
\(i\), the stability of its allegiance
to community \(A\) is calculated as the
number of times where node \(i\)
belongs to community \(A\), divided by
the total number of runs. This measure is bounded in \(\left[0,1\right]\), where higher values of
stability indicate that a node belong to a single community across a
greater number of runs, and can only work on a single layer of the
hierarchy at a time. Higher proportions of nodes with high levels of
community allegiance indicate a robust community detection
algorithm.

Below is a simple example of how community allegiance is calculated and visualized.

```
set.seed(7183)
<- sample(x = rep(1:3, 4), 12)
x <- sample(x = rep(1:3, 4), 12)
y <- sample(x = rep(1:3, 4), 12)
z
<- data.frame(id=seq(1:length(x)),x_comm=x,y_comm=y,z_comm=z)
xyz_comms <- community_allegiance(xyz_comms)
xyz_alleg <- reshape2::melt(xyz_alleg)
xyz_melt
print(xyz_comms)
## id x_comm y_comm z_comm
## 1 1 3 3 2
## 2 2 3 3 2
## 3 3 2 2 3
## 4 4 3 2 2
## 5 5 1 2 3
## 6 6 2 1 1
## 7 7 3 1 3
## 8 8 2 3 3
## 9 9 1 3 1
## 10 10 2 1 2
## 11 11 1 2 1
## 12 12 1 1 1
::ggplot(data = xyz_melt) +
ggplot2::theme_minimal() +
ggplot2::aes(x = as.factor(Var1), y = as.factor(Var2), fill = value) +
ggplot2::geom_tile() +
ggplot2::xlab('Node') + ggplot2::ylab('Node') +
ggplot2::ggtitle('Community Allegiance Example') +
ggplot2::scale_fill_gradient2(
ggplot2low = 'navy',
high = 'goldenrod1',
mid = 'darkturquoise',
midpoint = 0.5,
limit = c(0, 1),
space = 'Lab',
name='')
```

The average community allegiance value within this simple example is 0.3333333, meaning that, on average, two nodes appear in the same community 33% of the time.

If multiple runs of the HMS algorithm have been performed on the same
dataset and a single, representative partition is desired, end users can
utilize the `consensus_similarity`

function. This function
can be used to identify a single, representative partition from a set of
partitions that is most similar to all others. Similarity is taken to be
the z-score of the adjusted Rand coefficient. This definition of
consensus similarity was first defined by Doron et al. in a 2012
*PNAS* publication. Pairwise z-scores of the adjusted Rand
coefficient are calculated for all partitions, then the average pairwise
similarity is calculated for each partition. The partition with the
highest average pairwise similarity is then chosen to be the consensus
partition.

Below is a simple example of how the
`consensus_similarity`

function works.

```
set.seed(7183)
<- sample(x = rep(1:3, 4), 12)
x <- sample(x = rep(1:3, 4), 12)
y <- sample(x = rep(1:3, 4), 12)
z
<- matrix(c(x,y,z),nrow=length(x),ncol=3)
xyz_comms_mat consensus_similarity(xyz_comms_mat)
## [1] 3 3 2 3 1 2 3 2 1 2 1 1
```

End users can simulate group-level networks with community structure
using two main functions within the **CommKern** package:
`group_network_perturb`

and
`group_adj_perturb`

.

```
str(group_network_perturb)
## function (n_nodes, n_comm, n_nets, perturb_prop, wcr, bcr, bfcr = NA, fuzzy_comms = NA)
str(group_adj_perturb)
## function (group_network_list, n_nets, n_nodes)
```

First, `group_network_perturb`

creates a list of simulated
networks, of which each network is in a data.frame format, which
describes the community assignment for each node in the network and
simulates the edge weights based on whether the node dyad is (a) within
the same community; (b) between different communities; or (c) between
different communities but designated as “fuzzy” in their distinction
from one another. This function takes several inputs:

`n_nodes`

, the number of nodes in each network.`n_comm`

, the number of communities to simulate in each network.`n_nets`

, the number of networks to simulate.`perturb_prop`

, the proportion of network nodes to randomly alter their community assignment within each network; this helps to simulate individual variability.`wcr`

, the the within community edge weights, sampled from a Beta distribution.`bcr`

, the between community edge weights, sample from a Beta distribution.`bfcr`

, fuzzy community edge weights, sample from a Beta distribution`fuzzy_comms`

, the communities for which their distinction is “fuzzy” or not as distinct; fuzzy communities tend to have higher between-community edge weights than non-fuzzy communities.

Once the `group_network_perturb`

function is run, its
output is used as one of the inputs for the
`group-adj_perturb`

function, which produces an array of
adjacency matrices. These adjacency matrices can then be used as input
to any community detection algorithm (such as the HMS algorithm).

Below is an example of how to simulate a group of networks with similar community structure, with a visualization of one resulting simulated adjacency matrix.

```
<-
sim_nofuzzy group_network_perturb(
n_nodes = 50,
n_comm = 4,
n_nets = 3,
perturb_prop = 0.1,
wcr = c(8, 8),
bcr = c(1.5, 8)
)
<-
nofuzzy_adj group_adj_perturb(sim_nofuzzy, n_nets = 3, n_nodes = 50)
if (require(pheatmap)) {
::pheatmap(
pheatmap1,,],
nofuzzy_adj[treeheight_row = FALSE,
treeheight_col = FALSE
)
}## Loading required package: pheatmap
```

As well, the following is an example of group-level networks simulated with fuzzy network structure.

```
<-
sim_fuzzy group_network_perturb(
n_nodes = 50,
n_comm = 4,
n_nets = 3,
perturb_prop = 0.1,
wcr = c(8, 8),
bcr = c(1.5, 8),
bfcr = c(3.5, 8),
fuzzy_comms = c('comm_b', 'comm_c')
)
<-
fuzzy_adj group_adj_perturb(sim_fuzzy, n_nets = 3, n_nodes = 50)
if (require(pheatmap)) {
::pheatmap(
pheatmap1,,],
fuzzy_adj[treeheight_row = FALSE,
treeheight_col = FALSE
) }
```

Within community detection, the methods for evaluating the performance of a particular algorithm can be classified as either (1) extrinsic, requiring ground truth labels or (2) intrinsic not requiring ground truth labels.

All external clustering measures rely on a \(r \times k\) contingency matrix \(\mathbs{N}\) that is induced by a system-clustering \(\mathcal{C}\) and a ground truth partitioning \(\mathcal{T}\), where the rows represent the system-clusters \(\mathcal{C}_i \in \mathcal{C}\) for \(i=1,\dots,r\) and the columns represent the ground truth partitions \(\mathcal{T}_j \in \mathcal{T}\) for \(j=1,\dots,k\). A cell at the intersection of row \(i\) and column \(j\) contains the count \(n_{ij}\) of points that are common to cluster \(\mathcal{C}_i\) and ground truth partition \(\mathcal{T}_j\):

\[\begin{equation} N\left(i,j\right)=n_{ij}=\vert\mathcal{C}_i \cap \mathcal{T}_j \vert \end{equation}\] However, in many cases a ground truth labeling does not exist, such as brain connectivity. In this case, a two-way matching can be performed, where the matching is performed twice, each time with one of the clusterings being the ground truth and the results combined using a harmonic mean to produce a final matching measure.

There are several options for external cluster evaluation metrics. For all, assume \(\Omega=\left\{\omega_1,\omega_2,\dots,\omega_k\right\}\) and \(\mathbb{C}=\left\{c_1,c_2,\dots,c_j\right\}\) are two sets of clusters.

*Purity*: quantifies the extent to which a cluster \(c_i\) contains elements from only one partition, or measures how “pure” a cluster is. Purity is bounded in \(\left[0,1\right]\) but is not symmetric due to the need to have a ground truth labeling. To get around this issue, the harmonic mean of the purity between two clusterings is taken, where one partition is taken to be the ground truth labeling, then the other.

```
set.seed(7183)
<- sample(x = rep(1:3, 4), 12)
x <- sample(x = rep(1:3, 4), 12)
y
purity(x,y)
## [1] 0.5
```

*Normalized Mutual Information (NMI)*: based on the idea of mutual information, which is defined between two random variables as a measure of the mutual dependence between them. It quantifies the “amount of information” obtained about one random variable by observing the other. It is linked to the concept of entropy, which attempts to characterize the “unpredictability” of a random variable. Normalized variants of mutual information allow for comparisons between different clusterings that have different numbers of clusters (but not a different number of nodes, meaning the underlying networks must be of the same size).

```
set.seed(7183)
<- sample(x = rep(1:3, 4), 12)
x <- sample(x = rep(1:3, 4), 12)
y
NMI(x,y)
## [1] 0.05360537
```

*Adjusted Rand Index (ARI)*: An alternative to an information theoretic interpretation of clustering evaluation (e.g., NMI, purity) is to view clusterings as a series of decisions, one for each of the \(N\left(N-1\right)/2\) pairs of nodes in the network. The adjusted Rand Index (ARI) is the corrected-for-chance version of the Rand Index, which establishes a baseline by using the expected similarity of all pairwise comparisons between clusterings specified by a random model.ARI can yield negative values if the index is less than the expected index.

```
set.seed(7183)
<- sample(x = rep(1:3, 4), 12)
x <- sample(x = rep(1:3, 4), 12)
y
adj_RI(x,y)
## [1] -0.1458333
```

To create a matrix of distances, the function
`ext_distance`

can be used. This function uses the community
output values from any community detection algorithm, such as the HMS
algorithm, as an input. Because extrinsic cluster evaluation metrics use
the underlying idea of similarity, distance is calculated as \(\left(1-\text{similarity}\right)\). The use
of distance ensures that the distance matrix will be positive and
semi-definite, a requirement for the semiparametric kernel methods
detailed in the next section.

```
<- c(2,2,3,1,3,1,3,3,2,2,1,1)
x <- c(3,3,2,1,1,1,1,2,2,3,2,3)
y <- c(1,1,2,3,2,3,2,1,1,2,3,3)
z
<- data.frame(x_comm = x, y_comm = y, z_comm = z)
xyz_comms
ext_distance(xyz_comms, variant = 'NMI')
## 1 2 3
## [1,] 0.0000000 0.6963946 0.3412397
## [2,] 0.6963946 0.0000000 0.8412397
## [3,] 0.3412397 0.8412397 0.0000000
ext_distance(xyz_comms, variant = 'adj_RI')
## 1 2 3
## [1,] 0.0000000 0.9166667 0.4583333
## [2,] 0.9166667 0.0000000 1.0694444
## [3,] 0.4583333 1.0694444 0.0000000
ext_distance(xyz_comms, variant = 'purity')
## 1 2 3
## [1,] 0.0000000 0.4166667 0.1666667
## [2,] 0.4166667 0.0000000 0.5000000
## [3,] 0.1666667 0.5000000 0.0000000
```

Rather than use extrinsic metrics, which rely either on a ground truth labeling or on a harmonic mean between two clusterings, intrinsic evaluation metrics do not rely on the existence of a ground truth. One of the most commonly used intrinsic metrics is the optimized value of the modularity function; this is based on the fact that networks with very similar community structure should have very similar modularity values. Within the HMS algorithm, the optimized value of the Hamiltonian van be used in the same manner as the modularity function. Because the optimized Hamiltonian values from the HMS algorithm are not bounded in the same manner as the extrinsic cluster evaluation metrics, a p-norm of the differences between the Hamiltonians for two clusterings was used such that

\[\begin{equation} d\left(\mathcal{H}_{\Omega},\mathcal{H}_{\mathbb{C}}\right)=\Vert \mathcal{H}_{\Omega} - \mathcal{H}_{\mathbb{C}} \Vert_{p} \end{equation}\] for \(1 \leq p < \infty.\)

To create a matrix of distances, the function
`ham_distance`

can be used. This function uses the
Hamiltonian output values from a community detection algorithm that
implements a Hamiltonian value, such as the HMS algorithm. To ensure a
positive, semi-definite matrix (as required for the kernel machine
methods), the absolute difference between Hamiltonian values is
calculated.

```
<- data.frame(id = seq(1:8),
hamil_df ham = c(-160.5375, -167.8426, -121.7128,
-155.7245, -113.9834, -112.5262,
-117.9724, -171.374))
ham_distance(hamil_df)
## 1 2 3 4 5 6 7 8
## [1,] 0.0000 7.3051 38.8247 4.8130 46.5541 48.0113 42.5651 10.8365
## [2,] 7.3051 0.0000 46.1298 12.1181 53.8592 55.3164 49.8702 3.5314
## [3,] 38.8247 46.1298 0.0000 34.0117 7.7294 9.1866 3.7404 49.6612
## [4,] 4.8130 12.1181 34.0117 0.0000 41.7411 43.1983 37.7521 15.6495
## [5,] 46.5541 53.8592 7.7294 41.7411 0.0000 1.4572 3.9890 57.3906
## [6,] 48.0113 55.3164 9.1866 43.1983 1.4572 0.0000 5.4462 58.8478
## [7,] 42.5651 49.8702 3.7404 37.7521 3.9890 5.4462 0.0000 53.4016
## [8,] 10.8365 3.5314 49.6612 15.6495 57.3906 58.8478 53.4016 0.0000
```

A kernel is a function that takes as its inputs vectors in some original space and returns the dot product of vectors in the feature space. More formally, if we have \(\mathbf{x},\mathbf{z} \in \mathcal{X}\) and a map \(\phi: \mathcal{X} \rightarrow \mathbb{R}^N\), then \(k\left(\mathbf{x},\mathbf{z}\right)=\langle\phi\left(\mathbf{x}\right),\phi\left(\mathbf{z}\right)\rangle\) is a kernel function. An important concept relating to kernel methods is the reproducing kernel Hilbert space (RKHS). Because all kernels are positive definite, there exists one or more feature spaces for which the kernel defines the inner product, without having to explicitly define such feature space. Using the Moore-Aronszajn theorem, it can be shown that for each kernel \(k\) on a set \(\mathcal{X}\), there is a unique space of functions (known as the Hilbert space) on \(\mathcal{X}\) for which \(k\) is a reproducing kernel.

Letting \(Z\) be a multidimensional array of variables and \(i,j\) be two subjects, then \(k\left(Z_i,Z_j\right)\) can be used as a measure of similarity between the pair of subjects \(i,j\) since \(k\left(\cdot,\cdot\right)\) is positive definite. This similarity measure can then be incorporated into a statistical inference framework to test what extend variation in \(Z\) between individuals can explain variation in some outcome of interest \(Y\). A range of kernel functions are used in statistics, where the choice of kernel determines the function space used to approximate the relationship between two variables. A distance-based kernel is denoted as

\[\begin{equation} K_d\left(x_1,x_2\right)=exp\left\{\dfrac{-d^2(x_1,x_2)}{\rho}\right\}, \end{equation}\] where \(d^2(x_1,x_2)\) is a distance function and \(\rho\) an unknown bandwidth or scaling parameter.

While this package provides several options for cluster evaluation, many more exist in the literature. End users can utilize their own custom metrics, so long as the resulting distance matrix is symmetric, positive, and semi-definite, as these are the requirements for incorporation into the kernel machine methods detailed in the next section.

Suppose a dataset consists of \(n\) subjects, where for subject \(i=\left(1,\dots, n\right)\), \(y_i\) is an outcome variable (either binary or continuous), \(\mathbf{x}_i\) is a \(q \times 1\) vector of clinical covariates and \(\mathbf{x}_i\) is a \(p \times 1\) vector of cluster evaluation metrics. The outcome \(y_i\) depends on \(\mathbf{x}_i\) and \(\mathbf{z}_i\) through the following semiparametric linear model

\[\begin{equation} y_i = \mathbf{x}_{i}^{T}\mathbf{\beta} + h\left(\mathbf{z}_i\right) + e_i, \end{equation}\] where \(\mathbf{\beta}\) is a \(q \times 1\) vector of regression coefficients, \(h\left(\mathbf{z}_i\right)\) is an unknown, centered and smooth function, and \(e_i\) is an independent and identically distributed error term following \(N\left(0,\sigma^2\right)\).

This model allows for covariate effects to be modeled parametrically and the brain connectivity metric either parametrically or non-parametrically. There are several special cases of this model: when \(h\left(\cdot\right)=0\), the model reduces too a standard logistic regression model and when \(\mathbf{x}_i=1\), the model reduces to a least squares kernel machine regression. A hypothesis test can be conducted to determine whether the multidimensional variable set \(\mathbf{z}_i\) is associated with \(y_i\), controlling for \(\mathbf{x}_i\), of the form

\[\begin{equation} H_0: h\left( \cdot \right) = 0\\ H_A: h\left( \cdot \right) \neq 0\\ \end{equation}\]

Details of the derivation of the score test will be left out of this vignette. If end users are interested, please refer to Jensen et al. (2019); Liu, Lin, and Ghosh (2007); and Liu, Ghosh, and Lin (2008).

Using the datasets provided in the **CommKern** package,
we can implement the various kernel machine types. In all cases, a
parameter `grid_gran`

is preset to 5000; this variable
relates to the grid search length for the nuisance \(\rho\) parameter.

If the outcome is binary and no other covariates are of interest,
then use the `score_log_nonparam`

function:

```
str(score_log_nonparam)
## function (outcome, dist_mat, grid_gran = 5000)
```

Using the `simasd_covars`

demographics dataset, where
`dx_group`

will be used as the outcome of interest, we will
use `simasd_hamil_df`

as the basis for our distance
matrix.

```
<- ham_distance(simasd_hamil_df)
simasd_ham_mat
score_log_nonparam(outcome=simasd_covars$dx_group,
dist_mat=simasd_ham_mat)
## [1] 0.5703748
```

If we wish to incorporate additional covariates of interest, like
age, sex, or handedness, this can easily be done via the
`score_log_semiparam`

function.

```
str(score_log_semiparam)
## function (outcome, covars, dist_mat, grid_gran = 5000)
```

For our purposes, we will still use the Hamiltonian as the basis for our distance matrix. Because handedness is a categorical variable, we will convert it to a factor before adding the covariates dataframe to the function.

```
<- ham_distance(simasd_hamil_df)
simasd_ham_mat <- simasd_covars[,3:5]
simasd_confound $handedness <- as.factor(simasd_confound$handedness)
simasd_confound
score_log_semiparam(outcome=simasd_covars$dx_group,
covars=simasd_confound,
dist_mat=simasd_ham_mat)
## [1] 0.5368395
```

In a similar manner, if the outcome of interest is continuous and no
other covariates are of interest, then use the
`score_cont_nonparam`

function:

```
str(score_cont_nonparam)
## function (outcome, dist_mat, grid_gran = 5000)
```

Still using the `simasd_covars`

demographics dataset, we
will now use `verbal_IQ`

as the outcome of interest and we
will now use `simasd_comm_df`

as the basis for our distance
matrix, with the NMI as the extrinsic cluster evaluation metric

```
<- ext_distance(comm_df=simasd_comm_df,
simasd_NMI_mat variant=c("NMI"))
score_cont_nonparam(outcome=simasd_covars$verbal_IQ,
dist_mat=simasd_NMI_mat)
## [1] 0.5619891
```

If we wish to incorporate additional covariates of interest, like
age, sex, or handedness, this can easily be done via the
`score_cont_semiparam`

function.

```
str(score_cont_semiparam)
## function (outcome, covars, dist_mat, grid_gran = 5000)
```

For our purposes, we will now use purity as the basis for our distance matrix. Because handedness is a categorical variable, we will convert it to a factor before adding the covariates dataframe to the function.

```
<- ext_distance(comm_df=simasd_comm_df,
simasd_pur_mat variant=c("purity"))
<- simasd_covars[,3:5]
simasd_confound $handedness <- as.factor(simasd_confound$handedness)
simasd_confound
score_cont_semiparam(outcome=simasd_covars$verbal_IQ,
covars=simasd_confound,
dist_mat=simasd_pur_mat)
## [1] 0.3368465
```

Jensen, Alexandria M, Jason R Tregellas, Brianne Sutton, Fuyong Xing,
and Debashis Ghosh. 2019. “Kernel Machine Tests of Association
Between Brain Networks and Phenotypes.” *Plos One* 14 (3):
e0199340. https://doi.org/10.1371/journal.pone.0199340.

Liu, Dawei, Debashis Ghosh, and Xihong Lin. 2008. “Estimation and
Testing for the Effect of a Genetic Pathway on a Disease Outcome Using
Logistic Kernel Machine Regression via Logistic Mixed Models.”
*BMC Bioinformatics* 9 (1): 1–11. https://doi.org/10.1186/1471-2105-9-292.

Liu, Dawei, Xihong Lin, and Debashis Ghosh. 2007. “Semiparametric
Regression of Multidimensional Genetic Pathway Data: Least-Squares
Kernel Machines and Linear Mixed Models.” *Biometrics* 63
(4): 1079–88. https://doi.org/10.1111/j.1541-0420.2007.00799.x.