--- title: "Generating missing values" output: rmarkdown::html_vignette bibliography: bibliography.bibtex vignette: > %\VignetteIndexEntry{Generating-missing-values} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 4, fig.align = "center" ) ``` To generate missing values in a dataset with missMethods you can use one of the `delete_` functions. The names of these functions always starts with `delete_` and the next part of the name shows the used missing data mechanism. There are three basic types of missing data mechanisms: missing completely at random (`MCAR`), missing at random (`MAR`) and missing not at random (`MNAR`). A list of all available functions for the different mechanisms is given below: **MCAR** * `delete_MCAR()` **MAR** * `delete_MAR_1_to_x()` * `delete_MAR_censoring()` * `delete_MAR_one_group()` * `delete_MAR_rank()` **MNAR** * `delete_MNAR_1_to_x()` * `delete_MNAR_censoring()` * `delete_MNAR_one_group()` * `delete_MNAR_rank()` All these functions share a common interface. The first argument `ds` takes the dataset in which missing values should be generated. The next argument `p` specifies the proportion of missing values to include in every column with missing value. These columns are specified with the third argument `cols_mis`. The further arguments depend on the chosen function and are documented for every function separately. In most cases, reasonable defaults are set for these further arguments. Only the `MAR` functions need one additional argument with no default: `cols_ctrl`. The argument `cols_ctrl` specifies the columns that control the generation of missing data in a MAR settings. One further remark: All `MAR` functions have a `MNAR` twin. These twins behave exactly the same way. The only difference is the columns that controls the generation of missing values. In the `MAR` functions separate `cols_ctrl` columns controls the generation of missing values in the `cols_mis` columns. In contrast, in the `MNAR` functions the generation of missing values in the `cols_mis` columns is controlled by the `cols_mis` columns themselves. In the following, different examples for the generation of missing values will be presented. Furthermore, connections between the functions from missMethods and the paper by @Santos.2019 will be shown. ## Examples for the generation of missing values The examples below show the use of some `delete_` functions in a 2-dimensional dataset. Missing values are always generated in the variable "X" and 30 % of the values are deleted. At first, a basic set-up: ```{r setup} library(missMethods) library(ggplot2) set.seed(123) make_simple_MDplot <- function(ds_comp, ds_mis) { ds_comp$missX <- is.na(ds_mis$X) ggplot(ds_comp, aes(x = X, y = Y, col = missX)) + geom_point() } # generate complete data frame ds_comp <- data.frame(X = rnorm(100), Y = rnorm(100)) ``` ### MCAR Generate MCAR values: ```{r MCAR} ds_mcar <- delete_MCAR(ds_comp, 0.3, "X") make_simple_MDplot(ds_comp, ds_mcar) ``` ### MAR Generate MAR values using a censoring mechanism. This leads to a missing value in "X", if the y-value is below the 30 % quantile of "Y": ```{r MAR censoring} ds_mar <- delete_MAR_censoring(ds_comp, 0.3, "X", cols_ctrl = "Y") make_simple_MDplot(ds_comp, ds_mar) ``` The censoring mechanism is a rather strong form of MAR. A function that allows to control the strength of the MAR mechanism is `delete_MAR_1_to_x`. The strength is controlled through the argument `x`: the bigger `x`, the stronger the simulated `MAR` mechanism: ```{r MAR_1_to_2} # x = 2 ds_mar <- delete_MAR_1_to_x(ds_comp, 0.3, "X", cols_ctrl = "Y", x = 2) make_simple_MDplot(ds_comp, ds_mar) ``` ```{r MAR_1_to_10} # x = 10 ds_mar <- delete_MAR_1_to_x(ds_comp, 0.3, "X", cols_ctrl = "Y", x = 10) make_simple_MDplot(ds_comp, ds_mar) ``` ### MNAR Generate MAR values using a censoring mechanism. This leads to a missing value in "X", if the x-value is below the 30 % quantile of "X": ```{r MNAR censoring} ds_mnar <- delete_MNAR_censoring(ds_comp, 0.3, "X") make_simple_MDplot(ds_comp, ds_mnar) ``` ## Connections between missMethods and @Santos.2019 The following table shows the connections between the algorithm names of the missing data creation methods in @Santos.2019 and the functions of missMethods: @Santos.2019 | Function | Arguments ---------------|-------------------------|---------------------- MCAR1univa |`delete_MCAR` | n_mis_stochastic = FALSE MCAR2univa |`delete_MCAR` | all default MCAR3univa |`delete_MCAR` | all default MAR1univa |`delete_MAR_censoring` | sorting = FALSE MAR2univa |`delete_MAR_rank` | all default MAR3univa |`delete_MAR_1_to_x` | x = 1/9 MAR4univa |`delete_MAR_censoring` | where = "upper" MAR5univa |`delete_MAR_censoring` | where = "both" MNAR1univa |`delete_MNAR_censoring` | sorting = FALSE MNAR2univa |`delete_MNAR_censoring` | where = "upper" MCAR1unifo |`delete_MCAR` | n_mis_stochastic = FALSE MCAR2unifo |`delete_MCAR` | p_overall = TRUE MAR1unifo |`delete_MAR_censoring` | all default MAR2unifo |`delete_MAR_censoring` | sorting = FALSE MAR3unifo |`delete_MAR_one_group` | all default MAR4unifo |`delete_MAR_one_group` | all default MNAR1unifo |`delete_MNAR_censoring` | all default MNAR2unifo |`delete_MNAR_censoring` | sorting = FALSE MNAR3unifo |`delete_MAR_one_group` | all default MNAR4unifo |`delete_MAR_one_group` | all default Only the argument(s), which default values must be altered, are shown in the table. Notice that most functions in missMethods are more general than the described algorithms in @Santos.2019. Therefore, some functions of missMethods are able to replace different algorithms from @Santos.2019. In contrast to @Santos.2019, the user must always specify the missing column(s). However, this may change in the future. ## References