Package 'hint'

Title: Tools for Hypothesis Testing Based on Hypergeometric Intersection Distributions
Description: Hypergeometric Intersection distributions are a broad group of distributions that describe the probability of picking intersections when drawing independently from two (or more) urns containing variable numbers of balls belonging to the same n categories. <arXiv:1305.0717>.
Authors: Alex T. Kalinka
Maintainer: Alex T. Kalinka <[email protected]>
License: GPL (>= 2)
Version: 0.1-3
Built: 2025-01-27 05:10:38 UTC
Source: https://github.com/alextkalinka/hint

Help Index


add.distr

Description

This function will add one or more distributions or hypothesis tests to an existing plot.

Usage

add.distr(..., cols = "blue", test.cols = "red")

Arguments

...

One or more distributions or objects of class hint.test.

cols

A character string vector naming the colours of the distributions. If length(cols) is less than the number of distributions, the colours will be recycled. Defaults to "blue".

test.cols

A character string vector naming the colours to use for the regions in which the cumulative probability of the hypothesis test was derived (if it exists). If length(test.cols) is less than the number of distributions, the colours will be recycled. Defaults to "red".

Value

Plots to the current device.


The Binomial Intersection Distribution

Description

Density, distribution function, quantile function and random generation for the binomial intersection distribution.

Usage

dbint(n, A, range = NULL, log = FALSE)

pbint(n, A, vals, upper.tail = TRUE, log.p = FALSE)

qbint(p, n, A, upper.tail = TRUE, log.p = FALSE)

rbint(num = 5, n, A)

Arguments

n

An integer specifying the number of categories in the urns.

A

A vector of integers specifying the numbers of balls drawn from each urn. The length of the vector equals the number of urns.

range

A vector of integers specifying the intersection sizes for which probabilities (dhint) or cumulative probabilites (phint) should be computed (can be a single number). If range is NULL (default) then probabilities will be returned over the entire range of possible values.

log

Logical. If TRUE, probabilities p are given as log(p). Defaults to FALSE.

vals

A vector of integers specifying the intersection sizes for which probabilities (dhint) or cumulative probabilites (phint) should be computed (can be a single number). If range is NULL (default) then probabilities will be returned over the entire range of possible values.

upper.tail

Logical. If TRUE, probabilities are P(X >= v), else P(X <= v). Defaults to TRUE.

log.p

Logical. If TRUE, probabilities p are given as log(p). Defaults to FALSE.

p

A probability between 0 and 1.

num

An integer specifying the number of random numbers to generate. Defaults to 5.

Details

The binomial intersection distribution is given by

P(X=vN)=(bv)(i=1N1pi)v(1i=1N1pi)bvP(X = v|N) = {b \choose v} \left(\prod_{i=1}^{N-1} p_{i}\right)^{v} \left(1 - \prod_{i=1}^{N-1} p_{i}\right)^{b-v}

where b gives the sample size which is smallest. This is an approximation for the hypergeometric intersection distribution when nn is large and bb is small relative to the samples taken from the N1N-1 other urns.

Examples

## Generate the distribution of intersections sizes:
dd <- dbint(20, c(10, 12, 11, 14))
## Restrict the range of intersections.
dd <- dbint(20, c(10, 12), range = 0:5)
## Generate cumulative probabilities.
pp <- pbint(29, c(15, 8), vals = 5)
pp <- pbint(29, c(15, 8), vals = 2, upper.tail = FALSE)
## Extract quantiles:
qq <- qbint(0.15, 23, c(12, 10))
## Generate random samples from Binomial intersection distributions.
rr <- rbint(num = 10, 18, c(9, 14))

hint.dist.test

Description

Tests whether the absolute distance between two intersection sizes would be expected by chance, i.e. whether they fall into opposite tails of their respective Hypergeometric Intersection distributions.

Usage

hint.dist.test(d, n1, A1, n2, A2, q1 = 0, q2 = 0, alternative = "greater")

Arguments

d

A positive integer specifying the observed distance to be tested.

n1

An integer specifying the number of categories in the urns for the first distribution.

A1

An integer vector specifying the number of balls drawn from urns for the first distribution.

n2

An integer specifying the number of categories in the urns for the second distribution.

A2

An integer vector specifying the number of balls drawn from the urns for the second distribution.

q1

An integer specifying the number of categories with duplicates in the second urn of the first distribution. If 0 then the symmetric, singleton case is computed, otherwise the asymmetric, duplicates case is computed (see Hyperintersection).

q2

An integer specifying the number of categories with duplicates in the second urn of the second distribution. If 0 then the symmetric, singleton case is computed, otherwise the asymmetric, duplicates case is computed (see Hyperintersection).

alternative

A characer string specifying the hypothesis to be tested. Can be one of "greater", "less", or "two.sided".

Details

The distribution of absolute distances between two hypergeometric intersection sizes is given by

P(X=d)={v1,v2}iDdDdP(v1in1,a1,b1,...)P(v2in2,a2,b2,...)P(X=d) = \sum_{\{v_{1},v_{2}\}_{i} \in D_{d}}^{|D_{d}|} P(v_{1_i}|n_{1},a_{1},b_{1},...)\cdot P(v_{2_i}|n_{2},a_{2},b_{2},...)

where DdD_{d} is the set of pairs of intersection sizes, {v1,v2}\{v_{1},v_{2}\}, with absolute differences of size dd.

Value

An object of class hint.dist.test, which is a list containing the following components:

  • parameters An integer vector giving the parameter values.

  • p.value A numerical value giving the p-value associated with the test.

  • alternative A character string naming the hypothesis that was tested.


hint.test

Description

Apply the hypergeometric intersection test to categorical data to test for enrichment or depletion of intersections between two samples.

Usage

hint.test(cats, draw1, draw2, alternative = "greater")

Arguments

cats

A data frame or matrix with 3 columns; the first gives the category identifier, and the second and third give the number of balls belonging to this category in the first and second urns respectively.

draw1

A vector of objects corresponding to the categories given in cats drawn from the first urn.

draw2

A vector of objects corresponding to the categories given in cats drawn from the second urn.

alternative

A characer string specifying the hypothesis to be tested. Can be one of "greater", "less", or "two.sided".

Details

The hypergeometric intersection distributions describe the distribution of intersection sizes when sampling without replacement from two separate urns in which reside balls belonging to the same n object categories (see Hyperintersection).

Value

An object of class hint.test, which is a list containing the following components:

  • parameters An integer vector giving the parameter values.

  • p.value A numerical value giving the p-value associated with the test.

  • alternative A character string naming the hypothesis that was tested.

References

Kalinka, A. T. (2013). The probability of drawing intersections: extending the hypergeometric distribution. arXiv.1305.0717


Drawing Distinct Categories from a Single Urn

Description

Density, distribution function, quantile function and random generation for the distribution of distinct categories drawn from a single urn in which there are duplicates in q of the categories.

Usage

dhydist(n, a, q, range = NULL, log = FALSE)

phydist(n, a, q, vals, upper.tail = TRUE, log.p = FALSE)

qhydist(p, n, a, q, upper.tail = TRUE, log.p = FALSE)

rhydist(num = 5, n, a, q)

Arguments

n

An integer specifying the number of categories in the urn.

a

An integer specifying the number of balls drawn from the urn.

q

An integer specifying the number of categories in the urn which have duplicate members.

range

A vector of integers specifying the intersection sizes for which probabilities (dhydist) or cumulative probabilites (phydist) should be computed (can be a single number). If range is NULL (default) then probabilities will be returned over the entire range of possible values.

log

Logical. If TRUE, probabilities p are given as log(p). Defaults to FALSE.

vals

A vector of integers specifying the intersection sizes for which probabilities (dhydist) or cumulative probabilites (phydist) should be computed (can be a single number). If range is NULL (default) then probabilities will be returned over the entire range of possible values.

upper.tail

Logical. If TRUE, probabilities are P(X >= c), else P(X <= c). Defaults to TRUE.

log.p

Logical. If TRUE, probabilities p are given as log(p). Defaults to FALSE.

p

A probability between 0 and 1.

num

An integer specifying the number of random numbers to generate. Defaults to 5.

Examples

## Generate the distribution of distinct categories drawn from a single urn.
dd <- dhydist(20, 10, 12)
## Restrict the range of intersections.
dd <- dhydist(20, 10, 12, range = 5:10)
## Generate cumulative probabilities.
pp <- phydist(29, 15, 8, vals = 5)
pp <- phydist(29, 15, 8, vals = 2, upper.tail = FALSE)
## Extract quantiles:
qq <- qhydist(0.15, 23, 12, 10)
## Generate random samples based on this distribution.
rr <- rhydist(num = 10, 18, 9, 12)

The Hypergeometric Intersection Family of Distributions

Description

The Hypergeometric Intersection Family of Distributions

Usage

dhint(n, A, q = 0, range = NULL, approx = FALSE, log = FALSE, verbose = TRUE)

phint(n, A, q = 0, vals, upper.tail = TRUE, log.p = FALSE)

qhint(p, n, A, q = 0, upper.tail = TRUE, log.p = FALSE)

rhint(num = 5, n, A, q = 0)

Arguments

n

An integer specifying the number of categories in the urns.

A

A vector of integers specifying the numbers of balls drawn from each urn. The length of the vector equals the number of urns.

q

An integer specifying the number of categories in the second urn which have duplicate members. If q is 0 (default) then the symmetrical, singleton case is computed, otherwise the asymmetrical, duplicates case is computed (see Details).

range

A vector of integers specifying the intersection sizes for which probabilities (dhint) or cumulative probabilites (phint) should be computed (can be a single number). If range is NULL (default) then probabilities will be returned over the entire range of possible values.

approx

Logical. If TRUE, a binomial approximation will be used to generate the distribution.

log

Logical. If TRUE, probabilities p are given as log(p). Defaults to FALSE.

verbose

Logical. If TRUE, progress of calculation in the asymmetric, duplicates case is printed to the screen.

vals

A vector of integers specifying the intersection sizes for which probabilities (dhint) or cumulative probabilites (phint) should be computed (can be a single number). If range is NULL (default) then probabilities will be returned over the entire range of possible values.

upper.tail

Logical. If TRUE, probabilities are P(X >= c), else P(X <= c). Defaults to TRUE.

log.p

Logical. If TRUE, probabilities p are given as log(p). Defaults to FALSE.

p

A probability between 0 and 1.

num

An integer specifying the number of random numbers to generate. Defaults to 5.

Details

The hypergeometric intersection distributions describe the distribution of intersection sizes when sampling without replacement from two separate urns in which reside balls belonging to the same n object categories. In the simplest case when there is exactly one ball in each category in each urn (symmetrical, singleton case), then the distribution is hypergeometric:

P(X=v)=(av)(nabv)(nb)P(X=v)=\frac{{a \choose v}{n-a \choose b-v}}{{n \choose b}}

When there are three urns, the distribution is given by

P(X=v)=(av)i(avi)(nabvi)(nvicv)(nb)(nc)P(X=v) = \frac{ {a \choose v} \sum_{i} {a-v \choose i} {n-a \choose b-v-i} {n-v-i \choose c-v} }{ {n \choose b} {n \choose c} }

If, however, we allow duplicates in qnq \leq n of the categories in the second urn, then the distribution of intersection sizes is described by the following variant of the hypergeometric:

P(X=v)=m=0αl=0βj=0l(nqvl)(ql)(qlm)(nvq+lavm)(lj)(n+qamjbv)/(na)(n+qb)P(X=v) = \sum_{m=0}^{\alpha} \sum_{l=0}^{\beta} \sum_{j=0}^{l} {n-q \choose v-l} {q \choose l} {q-l \choose m} {n-v-q+l \choose a-v-m} {l \choose j} {n+q-a-m-j \choose b-v} / {n \choose a}{n+q \choose b}

Value

'dhint', 'phint', and 'qhint' return a data frame with two columns: v, the intersection size, and p, the associated p-values. 'rhint' returns an integer vector of random samples based on the hypergeometric intersection distribution.

References

Kalinka, A. T. (2013). The probability of drawing intersections: extending the hypergeometric distribution. arXiv.1305.0717

Examples

## Generate the distribution of intersections sizes without duplicates:
dd <- dhint(20, c(10, 12))
## Restrict the range of intersections.
dd <- dhint(20, c(10, 12), range = 0:5)
## Allow duplicates in q of the categories in the second urn:
dd <- dhint(35, c(15, 11), 22, verbose = FALSE)
## Generate cumulative probabilities.
pp <- phint(29, c(15, 8), vals = 5)
pp <- phint(29, c(15, 8), vals = 2, upper.tail = FALSE)
pp <- phint(29, c(15, 8), 23, vals = 2)
## Extract quantiles:
qq <- qhint(0.15, 23, c(12, 10))
qq <- qhint(0.15, 23, c(12, 10), 18)
## Generate random samples from Hypergeometric intersection distributions.
rr <- rhint(num = 10, 18, c(9, 14))
rr <- rhint(num = 10, 22, c(11, 17), 12)

plot.hint.test

Description

This function visualises the results of a Hypergeometric Intersection test.

Usage

## S3 method for class 'hint.test'
plot(x, ...)

Arguments

x

An object of class 'hint.test'.

...

Additional arguments to be passed to 'plot'.

Details

Plots the relevant Hypergeometric Intersection distribution as a segment plot, and highlights the region where the observed statistic falls, i.e. the region from which the probability is computed (two.sided tests are visualised in one tail, the one with the smallest density). This can be especially useful for pedagogical purposes.

Value

Plots to the current device.


plotDistr

Description

Plot a distribution or visualise the result of a hypothesis test.

Usage

plotDistr(
  distr,
  col = "black",
  test.col = "red",
  xlim = NULL,
  ylim = NULL,
  xlab = "Intersection size (v)",
  ylab = "Probability",
  add = FALSE,
  ...
)

Arguments

distr

A data frame or matrix in which the first column gives random variable values, and the second gives probabilities. Can also be a vector (in which case random variables of 0:length(distr) will be automatically assigned, or an object of class hint.test.

col

A character string naming the colour to use for the distribution. Defaults to "black".

test.col

A character string naming the colour to use for the region in which the cumulative probability of the hypothesis test was derived (if it exists). Defaults to "red".

xlim

A vector of two numbers giving the range for the x-axis. If NULL (default), then this is determined by the maximum and minimum values in distr.

ylim

A vector of two numbers giving the range for the y-axis. If NULL (default), then this is determined by the maximum and minimum values in distr.

xlab

A character string giving a label for the x-axis. Deafults to "Intersection size (v)".

ylab

A character string giving a label for the y-axis. Deafults to "Probability".

add

Logical. Whether the plot will be added to an existing plot or not. Defaults to FALSE.

...

Additional arguments to be passed to plot.

Details

Visualising the results of a hypothesis test may often be of interest, but can be especially useful for pedagogical purposes.

Value

Plots to the current device.


print.hint.test

Description

Prints the resuls of 'hint.test'.

Usage

## S3 method for class 'hint.test'
print(x, ...)

Arguments

x

An object of class 'hint.test'.

...

Additional arguments to be passed to 'print'.

Value

Prints output to the console.