Association measures for collocation and collostruction analyses
coll_analysis.Rd
Calculates common association measures used to perform collocation or collostruction analysis for typical count data.
Usage
coll_analysis(.x, ...)
# S3 method for data.frame
coll_analysis(
.x,
o11 = NULL,
f1 = NULL,
f2 = NULL,
n = NULL,
fun = "ll",
flip = NULL,
...
)
# S3 method for matrix
coll_analysis(.x, f2 = NULL, n = NULL, fun = "ll", flip = NULL, ...)
# S3 method for default
coll_analysis(.x, o11, f1, f2 = NULL, n = NULL, fun = "ll", flip = NULL, ...)
Arguments
- .x
data.frame or list containing data
- ...
further arguments to be passed to or from other methods
- o11
numeric: joint frequencies
- f1
numeric: corpus frequencies of the word
- f2
numeric of length 1 or equal to o11: corpus frequencies of co-occurring structure; if omitted, sum of o11 is used
- n
numeric of length 1 or equal to o11: corpus or sample size; if omitted,
sum(f1 + f2)
is used; this might be undesired in the case of collostruction analysis, where corpus size should always be explicitly passed- fun
character vector or named list containing character, function or expression elements: for built-in measures (see Details).
- flip
character: names of measures for which to flip the sign for cases with negative association, intended for two-sided measures
Value
an object similar to .x with one result per column for the
association measures specified in fun
; row names in matrices and character
or factor columns in data.frames are preserved
Details
For collocation analysis, f1 and f2 typically represent the corpus
frequencies of the word and the collocate, respectively, i.e. frequencies of
co-occurrence included. For collostruction analysis, f1 represents the corpus
frequencies of the word, and f2 the construction frequency. In a contingency
table, they represent marginal sums.
Both the construction frequency f2
and the corpus size n
can be provided
as vectors, which allows for efficient calculations over data from multiple
constructions/corpora.
For data.frame input, the values for "o11", "f1", "f2", "n" can either be provided explicitly as expression or character argument or implicitly by column name. It is recommended to pass the columns explicitly.
Matrix input currently requires column names "o11", "f1", "f2", "n"
Examples
data(adjective_cooccurrence)
.x <- subset(adjective_cooccurrence, word != collocate)
n <- attr(adjective_cooccurrence, "corpus_size")
res <- coll_analysis(.x, o11, f1, f2, n, fun = "ll")
res[order(res$ll, decreasing = TRUE), ] |> head()
#> word collocate ll
#> 3953 economic social 429.3335
#> 346 economic political 366.6125
#> 1583 catholic roman 294.1170
#> 4240 political social 286.6264
#> 34605 fiscal uniform 277.0459
#> 2168 black white 256.3603
# if arguments match column names, they can be used explicitly
c("o11", "f1", "f2") %in% names(.x) # TRUE
#> [1] TRUE TRUE TRUE
coll_analysis(.x, n = n, fun = "ll") |>
head()
#> word collocate ll
#> 1 grand recent 7.983294
#> 2 executive over-all 13.500985
#> 3 possible superior 15.837249
#> 4 hard-fought superior 17.566465
#> 5 hard-fought possible 13.312133
#> 6 relative such 20.512751
# control names of output columns by using a named list
coll_analysis(.x, o11, f1, f2, n, fun = list(logl = "ll")) |>
head()
#> word collocate logl
#> 1 grand recent 7.983294
#> 2 executive over-all 13.500985
#> 3 possible superior 15.837249
#> 4 hard-fought superior 17.566465
#> 5 hard-fought possible 13.312133
#> 6 relative such 20.512751
# using custom function
mi_base2 <- \(o11, e11) log2(o11 / e11)
coll_analysis(.x, o11, f1, f2, n, fun = mi_base2) |>
head()
#> word collocate mi_base2
#> 1 grand recent 7.171506
#> 2 executive over-all 11.111001
#> 3 possible superior 7.108428
#> 4 hard-fought superior 13.655322
#> 5 hard-fought possible 10.600281
#> 6 relative such 6.306594
# mix built-in measures with custom functions
coll_analysis(.x, n = n, fun = list(builtin = "ll", custom = mi_base2)) |>
head()
#> word collocate builtin custom
#> 1 grand recent 7.983294 7.171506
#> 2 executive over-all 13.500985 11.111001
#> 3 possible superior 15.837249 7.108428
#> 4 hard-fought superior 17.566465 13.655322
#> 5 hard-fought possible 13.312133 10.600281
#> 6 relative such 20.512751 6.306594