Association measures for collocation and collostruction analyses

Calculates common association measures used to perform collocation or collostruction analysis for typical count data.

Usage

coll_analysis(.x, ...)

# S3 method for data.frame
coll_analysis(
  .x,
  o11 = NULL,
  f1 = NULL,
  f2 = NULL,
  n = NULL,
  fun = "ll",
  flip = NULL,
  ...
)

# S3 method for matrix
coll_analysis(.x, f2 = NULL, n = NULL, fun = "ll", flip = NULL, ...)

# S3 method for default
coll_analysis(.x, o11, f1, f2 = NULL, n = NULL, fun = "ll", flip = NULL, ...)

Arguments

.x: data.frame or list containing data
...: further arguments to be passed to or from other methods
o11: numeric: joint frequencies
f1: numeric: corpus frequencies of the word
f2: numeric of length 1 or equal to o11: corpus frequencies of co-occurring structure; if omitted, sum of o11 is used
n: numeric of length 1 or equal to o11: corpus or sample size; if omitted, sum(f1 + f2) is used; this might be undesired in the case of collostruction analysis, where corpus size should always be explicitly passed
fun: character vector or named list containing character, function or expression elements: for built-in measures (see Details).
flip: character: names of measures for which to flip the sign for cases with negative association, intended for two-sided measures

Value

an object similar to .x with one result per column for the association measures specified in fun; row names in matrices and character or factor columns in data.frames are preserved

Details

For collocation analysis, f1 and f2 typically represent the corpus frequencies of the word and the collocate, respectively, i.e. frequencies of co-occurrence included. For collostruction analysis, f1 represents the corpus frequencies of the word, and f2 the construction frequency. In a contingency table, they represent marginal sums. Both the construction frequency f2 and the corpus size n can be provided as vectors, which allows for efficient calculations over data from multiple constructions/corpora.

For data.frame input, the values for "o11", "f1", "f2", "n" can either be provided explicitly as expression or character argument or implicitly by column name. It is recommended to pass the columns explicitly.

Matrix input currently requires column names "o11", "f1", "f2", "n"

Examples


data(adjective_cooccurrence)
.x <- subset(adjective_cooccurrence, word != collocate)
n <- attr(adjective_cooccurrence, "corpus_size")
res <- coll_analysis(.x, o11, f1, f2, n, fun = "ll")
res[order(res$ll, decreasing = TRUE), ] |> head()
#>            word collocate       ll
#> 3953   economic    social 429.3335
#> 346    economic political 366.6125
#> 1583   catholic     roman 294.1170
#> 4240  political    social 286.6264
#> 34605    fiscal   uniform 277.0459
#> 2168      black     white 256.3603

# if arguments match column names, they can be used explicitly
c("o11", "f1", "f2") %in% names(.x) # TRUE
#> [1] TRUE TRUE TRUE
coll_analysis(.x, n = n, fun = "ll") |>
  head()
#>          word collocate        ll
#> 1       grand    recent  7.983294
#> 2   executive  over-all 13.500985
#> 3    possible  superior 15.837249
#> 4 hard-fought  superior 17.566465
#> 5 hard-fought  possible 13.312133
#> 6    relative      such 20.512751

# control names of output columns by using a named list
coll_analysis(.x, o11, f1, f2, n, fun = list(logl = "ll")) |>
  head()
#>          word collocate      logl
#> 1       grand    recent  7.983294
#> 2   executive  over-all 13.500985
#> 3    possible  superior 15.837249
#> 4 hard-fought  superior 17.566465
#> 5 hard-fought  possible 13.312133
#> 6    relative      such 20.512751

# using custom function
mi_base2 <- \(o11, e11) log2(o11 / e11)
coll_analysis(.x, o11, f1, f2, n, fun = mi_base2) |>
  head()
#>          word collocate  mi_base2
#> 1       grand    recent  7.171506
#> 2   executive  over-all 11.111001
#> 3    possible  superior  7.108428
#> 4 hard-fought  superior 13.655322
#> 5 hard-fought  possible 10.600281
#> 6    relative      such  6.306594

# mix built-in measures with custom functions
coll_analysis(.x, n = n, fun = list(builtin = "ll", custom = mi_base2)) |>
  head()
#>          word collocate   builtin    custom
#> 1       grand    recent  7.983294  7.171506
#> 2   executive  over-all 13.500985 11.111001
#> 3    possible  superior 15.837249  7.108428
#> 4 hard-fought  superior 17.566465 13.655322
#> 5 hard-fought  possible 13.312133 10.600281
#> 6    relative      such 20.512751  6.306594