exactt

Introduction

The exactt package tests whether a slope coefficient is equal to some null value using the novel method described in Pouliot (2023). Importantly, inverting such a test produces a marginally valid confidence interval.

Installation

The exactt package is hosted on GitHub at https://github.com/ian-xu-economics/exactt/. It can be installed using the remotes::install_github() function:

# install.packages("remotes")
remotes::install_github("ian-xu-economics/exactt")

Attribution

To cite the exactt package in publications, use the citation() function, which provides both the text version and the BibTeX entry for referencing:

citation("exactt")

Using `exactt`

After installing exactt, we can attach the package to our session using the base library() function:

library("exactt")

Example Usage: Regular Case

To compute the $(1-\alpha)$ confidence interval, use the exactt() function. Here’s an example looking at the effect of vitamin C on tooth growth in guinea pigs using data from datasets::ToothGrowth. We’ll investigate the relationship between supp (orange juice (OJ) or ascorbic acid (VC)) and dose (dose in milligrams/day) on len (tooth length).

summary(datasets::ToothGrowth)
#>       len        supp         dose      
#>  Min.   : 4.20   OJ:30   Min.   :0.500  
#>  1st Qu.:13.07   VC:30   1st Qu.:0.500  
#>  Median :19.25           Median :1.000  
#>  Mean   :18.81           Mean   :1.167  
#>  3rd Qu.:25.27           3rd Qu.:2.000  
#>  Max.   :33.90           Max.   :2.000

Suppose our model is $len_i = \beta_0 + \beta_{dose} \times dose_i + \beta_{supp} \times supp_i + \varepsilon_i$ . We can create a 90% confidence interval by plugging in standard formula notation into exactt(). The level of significance (alpha) equals 0.1 here, but if we choose not to specify any additional parameters, then by default:

The number of blocks used equals 5 (nBlocks = 5).
The confidence interval is constructed for all variables (variables = NULL).
The number of permutations is equal to (nPerms = factorial(nBlocks)).
The level of significance equals 0.05 (alpha = 0.05).
The test statistics are studentized (studentize = TRUE).
The ordering of the data is not permuted (permutation = NULL).
The ordering of the data is not optimized (optimize = FALSE).

exactt.1 <- exactt(model = len ~ dose + supp,
                   data = datasets::ToothGrowth,
                   alpha = 0.1)

print(exactt.1, digits = 5)
#> 
#> Exact t-Test (Marginally Valid Tests)
#> 
#> Call:
#> exactt(model = len ~ dose + supp, data = datasets::ToothGrowth, 
#>     alpha = 0.1)
#> 
#> Summary:
#>         Estimate  P-value  Lower Bound  Upper Bound
#> dose     9.76357  0.07500      3.31768     16.01241
#> suppVC  -3.70000  0.26667    -10.98448      7.60653

Focusing on Specific Variables

To focus on specific coefficients, set the variables parameter. The number entered corresponds to the index of the regressors in the model (note that the intercept is never counted). For example, set variables = 1 for dose, and set variables = 2 for supp.

exactt.2 <- exactt(model = len ~ dose + supp,
                   data = datasets::ToothGrowth,
                   alpha = 0.1,
                   variables = 1)

print(exactt.2, digits = 5)
#> 
#> Exact t-Test (Marginally Valid Tests)
#> 
#> Call:
#> exactt(model = len ~ dose + supp, data = datasets::ToothGrowth, 
#>     alpha = 0.1, variables = 1)
#> 
#> Summary:
#>       Estimate  P-value  Lower Bound  Upper Bound
#> dose   9.76357    0.075      3.31768     16.01241

This creates a 90% confidence interval for dose only. It is equivalent to the case where variables = NULL (all variables are of interest) because these confidence intervals are marginally valid.

Model Flexibility

The exactt() function is designed to allow for easy modification of your model. For instance, you can treat a variable as categorical, include polynomial terms, or apply other transformations directly within the model formula. This flexibility helps tailor the analysis to specific research questions without needing pre-transformed data. To illustrate, consider treating dose as a categorical variable to explore its discrete impact on tooth length:

exactt.3 <- exactt(model = len ~ as.factor(dose) + supp,
                   data = datasets::ToothGrowth,
                   alpha = 0.1)

exactt.3
#> 
#> Exact t-Test (Marginally Valid Tests)
#> 
#> Call:
#> exactt(model = len ~ as.factor(dose) + supp, data = datasets::ToothGrowth, 
#>     alpha = 0.1)
#> 
#> Summary:
#>                   Estimate  P-value  Lower Bound  Upper Bound
#> as.factor(dose)1     9.130  0.07500         -Inf          Inf
#> as.factor(dose)2    15.495  0.21667          -64     29.71429
#> suppVC              -3.700  0.95000         -Inf          Inf

The 90% confidence intervals when dose equals “2” and supp equals “VC” is not informative due to suboptimal data ordering, which can diminish the statistical power of the test. This issue can be addressed by optimizing the data ordering.

Optimizing Data Ordering

The confidence intervals produced by the exactt() function can change with the ordering of the data. Certain data orderings can enhance statistical power, particularly when the sample size is small and the number of blocks is large. The impact of optimization is even more pronounced when dealing with categorical variables, where appropriate ordering can substantially increase the test’s power.

The exactt() function utilizes a genetic algorithm (provided by the GA::ga() function) to optimize data ordering. This approach systematically explores various data arrangements to find the one that maximizes statistical power on average.

Enabling Optimization

To activate the optimization feature, set optimize = TRUE. Additionally, exactt() allows for the specification of various parameters of the GA::ga() function to tailor the optimization process. For instance, you can limit the number of iterations with maxiter or specify the seed with seed for reproducibility:

exactt.4 <- exactt(model = len ~ as.factor(dose) + supp,
                   data = datasets::ToothGrowth,
                   alpha = 0.1,
                   optimize = TRUE,
                   parallel = FALSE,
                   maxiter = 5,
                   seed = 2024)
#> ✔ Optimizing ordering for `as.factor(dose)1`.
#> ✔ Optimizing ordering for `as.factor(dose)2`.
#> ✔ Optimizing ordering for `suppVC`.

print(exactt.4, digits = 5)
#> 
#> Exact t-Test (Marginally Valid Tests)
#> 
#> Call:
#> exactt(model = len ~ as.factor(dose) + supp, data = datasets::ToothGrowth, 
#>     alpha = 0.1, optimize = TRUE, seed = 2024, parallel = FALSE, 
#>     maxiter = 5)
#> 
#> Summary:
#>                   Estimate  P-value  Lower Bound  Upper Bound
#> as.factor(dose)1     9.130  0.00833      4.44096     12.06245
#> as.factor(dose)2    15.495  0.00833     14.30794     16.97312
#> suppVC              -3.700  0.01667     -6.61671     -3.59551

Note that by optimizing the data ordering, exactt() is now able to construct informative 90% confidence intervals for each category of dose and supp when they equal “2” and “VC” respectively. Furthermore, the detailed results of the optimization process, including the genetic algorithm’s configurations and outcomes for each variable, are stored in the exactt.4$gaResults. For instance, to review a summary of the genetic algorithm’s performance for the suppVC variable, use:

exactt.4$gaResults$suppVC@summary
#>           max     mean       q3   median       q1      min
#> [1,] 6.068181 4.164565 5.062314 4.052930 3.469694 2.298845
#> [2,] 7.411765 4.346684 5.408912 4.393609 3.322604 2.151261
#> [3,] 7.411765 4.343094 5.290857 4.175537 3.463937 1.098294
#> [4,] 7.411765 4.590258 5.369748 4.549311 3.632077 1.740592
#> [5,] 7.411765 4.449291 5.265123 4.493028 3.824316 1.774685

Note on Optimization Effects

While optimization generally improves statistical power, it is essential to remember that it increases the average power and may not universally reduce the confidence interval’s width in every instance.

Example Usage: IV Case

The exactt() function is capable of handling models with instrumental variables (IV). In Example 15.5 of Wooldridge (2020), Wooldridge reanalyzes Mroz (1987). This example explores the impact of education (educ) on log(wage), using parental education levels—mother’s education (motheduc) and father’s education (fatheduc)—as instruments. The model controls for experience (exper) and its square (expersq), with education being the primary variable of interest, hence we set variables = 1. Optionally, as before, we can optimize the data ordering to enhance statistical power.

exactt.iv <- exactt(model = lwage ~ educ + exper + expersq | exper + expersq + motheduc + fatheduc,
                    data = wooldridge::mroz,
                    variables = 1,
                    optimize = TRUE,
                    parallel = FALSE,
                    maxiter = 10,
                    monitor = TRUE,
                    seed = 31740)
#> ✔ Optimizing ordering for `educ`.
#> GA | iter = 1 | Mean = 2653027 | Best = 3095490
#> GA | iter = 2 | Mean = 2700054 | Best = 3095490
#> GA | iter = 3 | Mean = 2734650 | Best = 3095490
#> GA | iter = 4 | Mean = 2729035 | Best = 3095490
#> GA | iter = 5 | Mean = 2745544 | Best = 3167506
#> GA | iter = 6 | Mean = 2768499 | Best = 3167506
#> GA | iter = 7 | Mean = 2704935 | Best = 3167506
#> GA | iter = 8 | Mean = 2722396 | Best = 3167506
#> GA | iter = 9 | Mean = 2742183 | Best = 3167506
#> GA | iter = 10 | Mean = 2693768 | Best = 3167506

exactt.iv
#> 
#> Exact t-Test (Marginally Valid Tests)
#> 
#> Call:
#> exactt(model = lwage ~ educ + exper + expersq | exper + expersq + 
#>     motheduc + fatheduc, data = wooldridge::mroz, variables = 1, 
#>     optimize = TRUE, seed = 31740, parallel = FALSE, maxiter = 10, 
#>     monitor = TRUE)
#> 
#> Summary:
#>       Estimate  P-value  Lower Bound  Upper Bound
#> educ    0.0614  0.18333     -0.04061      0.13679