Introduction

In numerous research scenarios, the availability of detailed spatiotemporal (ST) point data is often greatly limited due to privacy considerations. To tackle this issue, the R-stppSim package has been created with the purpose of offering a solution. It enables users to replicate real-world data situations, thus offering an alternative reservoir of spatiotemporal point patterns. The suggested methodology employs microsimulation and agent-based methodologies to generate a collection of ‘walkers’ (which can represent agents, objects, individuals, etc.). These walkers possess defined movement characteristics and engage with the surrounding environment.

The package includes two main functions: (i) psim_artif and (ii) psim_real, both of which play a central role in simulating defined spatiotemporal interactions within point data. The function psim_artif generates these interactions based on user-provided parameters, effectively executing the simulation process without relying on any existing point data. In contrast, the function psim_real generates point interactions using the provided actual sample dataset. This latter function proves particularly valuable in situations where genuine point data is scarce or inadequate for practical applications.

Elements of data simulation

The following section describes three essential components of the simulation: the agents, the spatial factors, and the temporal aspects:

The agents (walkers)

The following properties defines the agents:

  • Movement - Agents or walkers possess the capacity to navigate in diverse directions and are equipped to identify obstacles or limitations along their trajectories. These movements are primarily governed by an inherent transition matrix (TM), which establishes two primary operational states: the exploratory state (where a walker is engaged in environmental exploration) and the performative state (where a walker is executing an action). The probabilistic characteristics of this TM introduce diversity in behavioral patterns among the walkers. To instigate a switch from one state to the other, a categorical distribution is assigned to a latent state variable \(z_{it}\), such that each step (in time) may result into the next state, independent of the previous state: \[z_t \sim Categorical(\Psi{_{1t}}, \Psi{_{2t}})\] Such that \(\Psi{_{i}}\) = Pr\((z_t = i)\), where \(\Psi{_{i}}\) is the fixed probability of being in state \(i\) at time \(t\), and \(\sum_{i=1}^{z}\Psi{_{i}}=1\)

  • Spatial perception [s_threshold] - Perception range of a walker at a specified location is determined by the parameter s_threshold. As the walker changes its position, this parameter undergoes an update. A common technique to set this parameter is by visually representing the data and then selecting an estimate that aligns with prior assumptions about the parameter. For many user cases, this strategy is quite effective. For psim_artif, users need to specify a value. However, for psim_real, the best-suited s_threshold value can be derived from the available sample dataset.

  • Steps [step_length] - The furthest distance a walker travels from one location point to another represents the step_length, which essentially characterizes the walker’s speed across an area. It’s vital to set the step_length judiciously, especially when the walker’s movements are confined to tight pathways like a route network. Here, teh chose value should be less than the pathway’s breadth.

  • Proportional ratios [p_ratio] - This refers to the density of events produced by the walkers in a given space. Specifically, it represents the fraction of total events stemming from a select group of the most active starting points. Take, for instance, a 20:80 ratio: this suggests that 20% of starting points (or walkers) are responsible for generating 80% of all point events. This implies that starting points possess varying intensity values, which can be leveraged to predict the eventual spatial distribution of these events, termed as the spatial model.

Spatial factors (landscape)

The followings are the key properties of a landscape:

  • Spatial bandwidth [s_band] The spatial bandwidth is utilized to identify event re-occurrences that take place between two specific spatial thresholds. For instance, setting a spatial bandwidth of 200m to 400m means the user aims to pinpoint repeated events happening within this distance range. When paired with the Temporal bandwidth (discussed further below), this defines a comprehensive spatiotemporal bandwidth. Please note: This applies solely to point pattern simulations created from scratch using the psim_artif function. For simulations grounded in actual sample datasets, spatial bandwidths are automatically identified.

  • Origins [coords] - Walkers originate from specific starting points, referred to as origins. These origins can be randomly scattered throughout an area or may follow particular spatial patterns. Each origin is characterized by its xy coordinates. For instance, in the context of criminology, an offender might be represented as a walker, with their home serving as the origin.

There are two primary patterns in which origins can be concentrated: nucleated and dispersed, as highlighted by (Hornby and Jones, 1991). In a nucleated concentration, all origins cluster around a single central point. On the other hand, a dispersed concentration features multiple focal points, with origins possibly spread randomly throughout the area (refer to fig. 1 for illustration).

Figure 1: Type of origin concentration

Figure 1: Type of origin concentration

  • Boundary [poly] - A landscape has defined boundaries, either represented by a polygon shapefile (known as poly) or determined by the spatial range of the sample point data.

  • Restrictions [restriction_feat] - Features that act as barriers consist of two main components:

  1. Regions outside of the defined boundary (poly), which have a maximum restriction value of 1. This means that walkers are prohibited from moving beyond this boundary.

  2. Features inside the boundary that hinder movement. These can be specific types of land use or physical landforms, like fenced-off areas or hills.

To produce a restriction map, one typically follows a two-step process. For instance, when using a boundary shapefile of the Camden area in London (UK), a restriction map can be constructed in the following manner:

Step 1: Generate boundary restriction

#load shapefile data
load(file = system.file("extdata", "camden.rda", package="stppSim"))
#extract boundary shapefile
boundary = camden$boundary # get boundary
#compute the restriction map
restrct_map <- space_restriction(shp = boundary,res = 20, binary = TRUE)
#plot the restriction map
plot(restrct_map)

Step 2: Setting the restrct_map above as the basemap, and then stack the land use features to define the restrictions within the area,

# get landuse data
landuse = camden$landuse 

#compute the restriction map
full_restrct_map <- space_restriction(shp = landuse, 
     baseMap = restrct_map, res = 20, field = "restrVal", background = 1)

#plot the restriction map
plot(full_restrct_map)
Figure 2: Restriction map

Figure 2: Restriction map

Figure 2 provides a graphical representation of both the boundary extent and the restrictions posed by the within-features. These within-features are categorized into three separate classes, each having a unique restriction value as enumerated below:

  • Leisure: 0.5
  • Sports: 0.7
  • Green: 0.9

These values indicate the relative restriction each land use type imposes on movement.

Within the simulation function, the boundary and the within-features are inputted using the poly and restriction_feat parameters, respectively. Both are provided in the .shp (shapefile) format.

  • Focal points [n_foci] - Locations, or origins, that hold greater significance often present more opportunities for event occurrences. This is specifically indicated when utilizing psim_artif. Users generally determine the number of focal points they wish to simulate. In terms of urban landscape structure, a focal point can equate to a city/town centre.

Additionally, if there’s a principal focal point within a city, it can be denoted using the mfocal parameter. By default, the value for mfocal is set to NULL.

There’s also a foci separation parameter that lets users define how close or far apart these focal points are from each other. This parameter accepts values ranging from 1 to 100. A value of 1 signifies the closest proximity, whereas 100 indicates the farthest distance between focal points.

Temporal dimension

The following parameters define the temporal dimension:

  • Temporal bandwidth [t_band] The temporal bandwidth is utilized to identify event re-occurrences that take place between two specific temporal thresholds. For instance, setting a spatial bandwidth of 2day to 4days means the user aims to pinpoint repeated events happening within this time range. When paired with the Spatial bandwidth (discussed above), this defines a comprehensive spatiotemporal bandwidth. Similar to spatial bandwidth', this applies solely to point pattern simulations created from scratch using thepsim_artif` function. For simulations grounded in actual sample datasets, temporal bandwidths are automatically identified.

  • Long-term trend [trend] - This parameter establishes the overarching trend of the time series that is to be simulated. The trend can be categorized as stable, rising, or falling.

  • Stable: Indicates that the time series remains relatively constant over time, with no significant upward or downward trend.

  • Rising: Suggests an upward trend in the time series. When this is selected, the supplementary slope argument can be employed to further define the incline of the trend as either gentle (a moderate increase) or steep (a rapid increase).

  • Falling: Denotes a downward trend in the time series. Similar to the rising trend, when this is chosen, the slope argument can be used to distinguish between a gentle decline or a steep drop.

This parameter is pertinent only when simulating a time series from the scratch, without any pre-existing data.

  • Seasonal peak [fPeak] - This parameter sets the initial temporal peak of a sinusoidal pattern in a time series, thereby dictating the medium-term undulations throughout the series’ duration. For instance, a first peak set at 90 days denotes a seasonal cycle spanning 180 days in the time series. This approach is primarily employed when the simulation’s objective isn’t to produce spatiotemporal interactions but to capture more general cyclic patterns within the data.
Figure 3: Global trends and patterns

Figure 3: Global trends and patterns

Figure 3 depicts anticipated seasonal patterns determined by various fPeak values. Beginning at 90 days, each subsequent pattern sees the fPeak value augmented by one month. As the fPeak date is pushed forward, the number of full seasonal cycles reduces.

The integration of the long-term trend with the seasonal peak shapes the temporal model for the simulation. Before launching the actual simulation, it is advisable to either preview or review this model to ensure accuracy and alignment with objectives.

  • time bin - Time to reset all walkers. Typically 1 day.

Installation of stppSim

From R console, type:

#To install from  `CRAN`
install.packages("stppSim")

#To install the `developmental version`, type:
remotes::install_github("MAnalytics/stppSim")
#Note: `remotes` is an extra package that needed to be installed prior to the running of this code.

Now, to load the package,

library(stppSim)

Notice:

interactive argument

Both psim_artif and psim_real functions include the interactive argument, which is set to FALSE as the default setting. When the interactive argument is toggled to TRUE, the console displays queries during the function’s execution, prompting the user to decide if they wish to view the spatial and temporal models of the simulation.

The spatial model displays the origins’ locations and their strength distribution across the simulated space. This strength distribution provides an insight into how the eventual point (event) distribution in the simulation is likely to be distributed.

On the other hand, the temporal model offers a visual representation of the expected trend and seasonal pattern, presented in a smoothed manner.

Thus, by using the interactive option, users are given the advantage of reviewing both spatial and temporal patterns, ensuring that they align with their expectations and objectives before moving forward with the complete simulation.

Simulating point patterns from scratch

Three essential arguments are necessary for the simulation:

  1. n_events - This refers to the number of points to simulate. Instead of providing just a single value, it’s recommended to input a vector of values. For instance, n_events = c(200, 500, 1000, 2000). The output is presented as a list, with each value corresponding to a separate data frame. Notably, the length of n_events has minimal to no impact on processing duration.

  2. start_date - This designates the commencement date of the time series.

  3. poly - This represents the polygon shapefile that demarcates the boundary of the study area. The simulated point patterns are restricted to occur within this designated boundary.

By providing these arguments, users can customize the scope and specifics of their simulation to meet their research objectives.

Example

To generate a spatiotemporal point pattern (stpp) using a boundary shapefile for the Camden Borough of London, which is embedded in the package, you the following code:


#load the data
load(file = system.file("extdata", "camden.rda",
                        package="stppSim"))

boundary <- camden$boundary # get boundary data

#specifying data sizes
pt_sizes = c(200, 1000, 2000)

#simulate data
artif_stpp <- psim_artif(n_events=pt_sizes, start_date = "2021-01-01",
  poly=boundary, n_origin=50, restriction_feat = NULL,
  field = NA,
  n_foci=5, foci_separation = 10, mfocal = NULL,
  conc_type = "dispersed",
  p_ratio = 20, s_threshold = 50, step_length = 20,
  trend = "stable", fpeak=NULL,
  slope = NULL,show.plot=FALSE, show.data=FALSE)

The processing time on an Intel Core i7-7500CPU @ 2.70GHz, 16.0GB RAM PC is 12.5 minutes. The processing time is increases to 45.2 minutes if landscape restriction is added. Specifically, this increase occurs when the argument restriction_feat = camden$landuse is used, accompanied by field = "val".

To retrieve the result of any n_events, simply type the object name with the value index. For example to retrieve the result based on n_events = 1000, type:

stpp_1000 <- artif_stpp[[2]]
  • Spatial Patterns

The configuration and clustering of events in the spatial domain can be fine-tuned by adjusting parameters that determine spatial components (such as restriction_feat, n_origin, mfocal, foci_separation, n_foci, s_band, and so forth) as well as those that guide walker behaviors (for example, step_length, s_threshold, and p_ratio). To introduce a focal point in the simulation (refer to the mfocal see package manual), employ the make_grids function. This function produces an interactive map that displays and permits the extraction of the xy coordinates from any location on the map. Enhanced with an integrated OpenStreetMap, the interactive platform aids users in more conveniently pinpointing specific locations.

Figure 4 showcases the spatial point patterns (spp) for n_events = 1000 under diverse parameter settings. Note: The spatial configuration may differ with each code execution due to inherent random aspects within the function.

Figure 4a displays the outcome when relying solely on default arguments, as demonstrated in the previous code.

Figure 4b presents the pattern resulting from the integration of additional parameters: restriction_feat = camden$landuse and mfocal = c(530000, 182250). Here, the first parameter restricts the number of events created within the land use (restriction) features, while the second emphasizes a central spatial concentration of origins, highlighted by a red dot on the map.

Figure 4c depicts the configuration when the parameters of restriction_feat and mfocal are retained (as in 4b), but with an added foci_separation = 50. This ensures a moderate spatial distance between individual origins.

Lastly, Figure 4d illustrates the spatial pattern when, besides maintaining the mfocal setting (similar to the above figures), the s_threshold and step_length are set at 250 and 50 respectively. This configuration aims to promote a broader distribution of points relative to their origins.