Title: | Multiplicity of Infection and Allele Frequency Recovery from Noisy Polyallelic Genetics Data |
---|---|
Description: | A Markov Chain Monte Carlo (MCMC) based approach to Bayesian estimation of individual level multiplicity of infection, within host relatedness, and population allele frequencies from polyallelic genetic data. |
Authors: | Maxwell Murphy [aut, cre] , Bryan Greenhouse [aut, ths] |
Maintainer: | Maxwell Murphy <[email protected]> |
License: | GPL (>= 3) |
Version: | 3.5.0 |
Built: | 2025-01-05 06:20:55 UTC |
Source: | https://github.com/eppicenter/moire |
Calculate the expected heterozygosity from allele frequencies
calculate_he(allele_freqs)
calculate_he(allele_freqs)
allele_freqs |
Simplex of allele frequencies |
Calculate the geometric median of the posterior distribution of allele frequencies
calculate_med_allele_freqs(mcmc_results, merge_chains = TRUE)
calculate_med_allele_freqs(mcmc_results, merge_chains = TRUE)
mcmc_results |
Result of calling run_mcmc() |
merge_chains |
boolean indicating that all chain results should be merged |
Returns the geometric median of the posterior distribution, defined as the point minimizing the L2 distance from each sampled point.
Calculate naive allele frequencies
calculate_naive_allele_frequencies(data)
calculate_naive_allele_frequencies(data)
data |
List of lists of numeric vectors, where each list element is a collection of observations across samples at a single genetic locus |
Estimate naive allele frequencies from the empirical distribution of alleles
Calculate naive COI
calculate_naive_coi(data)
calculate_naive_coi(data)
data |
List of lists of numeric vectors, where each list element is a collection of observations across samples at a single genetic locus. |
Estimates the complexity of infection using a naive approach that chooses the highest number of observed alleles.
Calculate naive COI offset
calculate_naive_coi_offset(data, offset)
calculate_naive_coi_offset(data, offset)
data |
List of lists of numeric vectors, where each list element is a collection of observations across samples at a single genetic locus. |
offset |
Numeric offset – n'th highest number of observed alleles |
Estimates the complexity of infection using a naive approach that chooses the n'th highest number of observed alleles.
Load delimited data
load_delimited_data(data, sep = ";", warn_uninformative = TRUE)
load_delimited_data(data, sep = ";", warn_uninformative = TRUE)
data |
data.frame containing the described data |
sep |
string used to separate alleles |
warn_uninformative |
boolean whether or not to print message when removing uninformative loci |
Load data.frame
with a sample_id
column and the remaining
columns are loci
. Each cell contains a separator delimited string
representing the observed alleles at that locus for that sample.
Returned data contains vectors sample_ids
and loci
that are ordered
as the results will be ordered from running the MCMC algorithm.
Load long form data
load_long_form_data(df, warn_uninformative = TRUE)
load_long_form_data(df, warn_uninformative = TRUE)
df |
data frame with 3 columns: |
warn_uninformative |
boolean whether or not to print message when removing uninformative loci |
Long form data is a data frame with
3 columns: sample_id
, locus
, allele
. Returned data contains
vectors sample_ids
and loci
that are ordered as the results
will be ordered from running the MCMC algorithm.
run_mcmc()
MCMC results from using the packaged simulated data and calling run_mcmc()
mcmc_results
mcmc_results
An object of class list
of length 3.
A dataset containing the genetic and epidemiological data from Namibia
namibia_data
namibia_data
A data frame with 7 columns and 97214 rows:
Sample ID
Health facility
Health district
Region
Country
Genetic locus
Allele observed
https://doi.org/10.7554/eLife.43510.018
Plot chain swap acceptance rates
plot_chain_swaps(mcmc_results)
plot_chain_swaps(mcmc_results)
mcmc_results |
list of results from |
Plot the swap acceptance rates for each chain. The x-axis is the temperature, and the y-axis is the swap acceptance rate. The dashed lines indicate the temperatures used for parallel tempering.
list of ggplot objects
Dirichlet distribution
rdirichlet(n, alpha)
rdirichlet(n, alpha)
n |
total number of draws |
alpha |
vector controlling the concentration of simplex |
Implementation of random sampling from a Dirichlet distribution
A list of allele frequencies for different regions, estimated from the pf7k dataset.
regional_allele_frequencies
regional_allele_frequencies
A list of lists, where each list element is a list of allele frequencies for a specific region.
Sample from the target distribution using MCMC
run_mcmc( data, is_missing = FALSE, allow_relatedness = TRUE, thin = 1, burnin = 10000, samples_per_chain = 1000, verbose = TRUE, use_message = FALSE, eps_pos_alpha = 1, eps_pos_beta = 1, eps_neg_alpha = 1, eps_neg_beta = 1, r_alpha = 1, r_beta = 1, mean_coi_shape = 0.1, mean_coi_scale = 10, max_eps_pos = 2, max_eps_neg = 2, max_coi = 40, record_latent_genotypes = FALSE, num_chains = 1, num_cores = 1, pt_chains = 1, pt_grad = 1, pt_num_threads = 1, adapt_temp = TRUE, pre_adapt_steps = 25, temp_adapt_steps = 25, max_initialization_tries = 10000, max_runtime = Inf )
run_mcmc( data, is_missing = FALSE, allow_relatedness = TRUE, thin = 1, burnin = 10000, samples_per_chain = 1000, verbose = TRUE, use_message = FALSE, eps_pos_alpha = 1, eps_pos_beta = 1, eps_neg_alpha = 1, eps_neg_beta = 1, r_alpha = 1, r_beta = 1, mean_coi_shape = 0.1, mean_coi_scale = 10, max_eps_pos = 2, max_eps_neg = 2, max_coi = 40, record_latent_genotypes = FALSE, num_chains = 1, num_cores = 1, pt_chains = 1, pt_grad = 1, pt_num_threads = 1, adapt_temp = TRUE, pre_adapt_steps = 25, temp_adapt_steps = 25, max_initialization_tries = 10000, max_runtime = Inf )
data |
Data to be used in MCMC, as generated by the |
is_missing |
Boolean matrix indicating whether the observation should be treated as missing data and ignored. Number of rows equals the number of loci, number of columns equals the number samples. Alternatively, the user may pass in FALSE if no data should be considered missing. |
allow_relatedness |
Bool indicating whether or not to allow relatedness within host |
thin |
Positive Integer. How often to sample from mcmc, 1 means do not thin |
burnin |
Positive Integer. Number of MCMC samples to discard as burnin |
samples_per_chain |
Positive Integer. Number of samples to take after burnin |
verbose |
Logical indicating if progress is printed |
use_message |
Logical indicating if progress is printed using message or print |
eps_pos_alpha |
Positive Numeric. Alpha parameter in Beta distribution for eps_pos prior |
eps_pos_beta |
Positive Numeric. Beta parameter in Beta distribution for eps_pos prior |
eps_neg_alpha |
Positive Numeric. Alpha parameter in Beta distribution for eps_neg prior |
eps_neg_beta |
Positive Numeric. Beta parameter in Beta distribution for eps_neg prior |
r_alpha |
Positive Numeric. Alpha parameter in Beta distribution for relatedness prior |
r_beta |
Positive Numeric. Beta parameter in Beta distribution for relatedness prior |
mean_coi_shape |
shape parameter for gamma hyperprior on mean COI |
mean_coi_scale |
scale parameter for gamma hyperprior on mean COI |
max_eps_pos |
Numeric. Maximum allowed value for eps_pos |
max_eps_neg |
Numeric. Maximum allowed value for eps_neg |
max_coi |
Positive Numeric. Maximum allowed complexity of infection |
record_latent_genotypes |
Logical indicating whether or not to record the latent genotypes at each step of the MCMC. WARNING: This will increase the size of the output object significantly. |
num_chains |
Total number of chains to run, possibly simultaneously |
num_cores |
Total OMP parallel threads to use to run chains. num_cores * pt_num_threads should not exceed the number of cores available on your system. |
pt_chains |
Total number of chains to run with parallel tempering or a vector containing the temperatures that should be used for parallel tempering. |
pt_grad |
Power to raise parallel tempering chains to. A value of 1 results in evenly distributed temperatures between [0,1], below 1 will bias towards 1 and above 1 will bias towards 0. Only used if pt_chains is a single value (i.e. not a vector). |
pt_num_threads |
Total number of OMP parallel threads to be used to process parallel tempered chains num_cores * pt_num_threads should not exceed the number of cores available on your system. |
adapt_temp |
Logical indicating whether or not to adapt the parallel
tempering temperatures. If TRUE, the temperatures will be adapted during the
|
pre_adapt_steps |
Number of steps to take before starting to adapt the
parallel tempering temperatures. Only used if |
temp_adapt_steps |
Number of steps to take between temperature
adaptation steps. Only used if |
max_initialization_tries |
Number of times to try to initialize the chain before giving up |
max_runtime |
Maximum runtime in minutes. If the MCMC is running for more than this amount of time, the function will stop and return the current state of the MCMC. |
Simulate allele frequencies
simulate_allele_frequencies(alpha, num_loci)
simulate_allele_frequencies(alpha, num_loci)
alpha |
vector parameter controlling the Dirichlet distribution |
num_loci |
total number of loci to draw |
Simulate allele frequency vectors as a draw from a Dirichlet distribution
Simulate data generated according to the assumed model
simulate_data( mean_coi = NULL, num_samples, epsilon_pos, epsilon_neg, sample_cois = NULL, locus_freq_alphas = NULL, allele_freqs = NULL, internal_relatedness_alpha = 0, internal_relatedness_beta = 1, internal_relatedness = NULL, missingness = 0 )
simulate_data( mean_coi = NULL, num_samples, epsilon_pos, epsilon_neg, sample_cois = NULL, locus_freq_alphas = NULL, allele_freqs = NULL, internal_relatedness_alpha = 0, internal_relatedness_beta = 1, internal_relatedness = NULL, missingness = 0 )
mean_coi |
Mean multiplicity of infection drawn from a Poisson |
num_samples |
Total number of biological samples to simulate |
epsilon_pos |
False positive rate, expected number of false positives |
epsilon_neg |
False negative rate, expected number of false negatives |
sample_cois |
List of sample COIs to be used instead of simulating |
locus_freq_alphas |
List of alpha vectors to be used to simulate from a Dirichlet distribution to generate allele frequencies. |
allele_freqs |
List of allele frequencies to be used instead of simulating allele frequencies |
internal_relatedness_alpha |
alpha parameter of beta distribution controlling the random relatedness draws for each sample |
internal_relatedness_beta |
beta parameter of beta distribution controlling the random relatedness draws for each sample |
internal_relatedness |
List of internal relatedness values to be used instead of simulating |
missingness |
probability of data being missing |
Simulated data that is structured to go into the MCMC sampler
Simulates the observation process
simulate_observed_allele(alleles, epsilon_pos, epsilon_neg, missingness)
simulate_observed_allele(alleles, epsilon_pos, epsilon_neg, missingness)
alleles |
A numeric vector representing the number of strains contributing each allele |
epsilon_pos |
expected number of false negatives |
epsilon_neg |
expected number of false positives |
missingness |
probability that the data is missing |
Takes a numeric value representing the number of strains contributing an allele and returns a binary vector indicating the presence or absence of the allele.
Simulate observed genotypes
simulate_observed_genotype( true_genotypes, epsilon_pos, epsilon_neg, missingness )
simulate_observed_genotype( true_genotypes, epsilon_pos, epsilon_neg, missingness )
true_genotypes |
a list of numeric vectors that are input to sim_observed_allele |
epsilon_pos |
expected number of false positives |
epsilon_neg |
expected number of false negatives |
missingness |
probability of data being missing |
Simulate the observation process across a list of observation vectors
Simulate sample COI
simulate_sample_coi(num_samples, mean_coi)
simulate_sample_coi(num_samples, mean_coi)
num_samples |
the total number of biological samples to simulate |
mean_coi |
mean multiplicity of infection |
Simulate sample COIs from a zero-truncated Poisson distribution
Simulate sample genotype
simulate_sample_genotype(sample_cois, locus_allele_dist, internal_relatedness)
simulate_sample_genotype(sample_cois, locus_allele_dist, internal_relatedness)
sample_cois |
Numeric vector indicating the multiplicity of infection for each biological sample |
locus_allele_dist |
Allele frequencies – simplex parameter of a multinomial distribution |
internal_relatedness |
numeric 0-1 indicating the probability for a strain's allele to come from an existing lineage within host |
Simulates sampling the genetics at a single locus given an allele frequency distribution and a vector of sample COIs
A simulated dataset created using simulate_data()
simulated_data
simulated_data
An object of class list
of length 9.
Summarize Function of Allele Frequencies
summarize_allele_freq_fn( mcmc_results, fn, lower_quantile = 0.025, upper_quantile = 0.975, merge_chains = TRUE )
summarize_allele_freq_fn( mcmc_results, fn, lower_quantile = 0.025, upper_quantile = 0.975, merge_chains = TRUE )
mcmc_results |
Result of calling run_mcmc() |
fn |
Function that takes as input a simplex to apply to each allele frequency vector |
lower_quantile |
The lower quantile of the posterior distribution to return |
upper_quantile |
The upper quantile of the posterior distribution to return |
merge_chains |
boolean indicating that all chain results should be merged |
General function to summarize the posterior distribution of functions of the sampled allele frequencies
Summarize allele frequencies
summarize_allele_freqs( mcmc_results, lower_quantile = 0.025, upper_quantile = 0.975, merge_chains = TRUE )
summarize_allele_freqs( mcmc_results, lower_quantile = 0.025, upper_quantile = 0.975, merge_chains = TRUE )
mcmc_results |
Result of calling run_mcmc() |
lower_quantile |
The lower quantile of the posterior distribution to return |
upper_quantile |
The upper quantile of the posterior distribution to return |
merge_chains |
boolean indicating that all chain results should be merged |
Summarize individual allele frequencies from the posterior distribution of sampled allele frequencies
Summarize COI
summarize_coi( mcmc_results, lower_quantile = 0.025, upper_quantile = 0.975, naive_offset = 2, merge_chains = TRUE )
summarize_coi( mcmc_results, lower_quantile = 0.025, upper_quantile = 0.975, naive_offset = 2, merge_chains = TRUE )
mcmc_results |
Result of calling run_mcmc |
lower_quantile |
The lower quantile of the posterior distribution to return |
upper_quantile |
The upper quantile of the posterior distribution to return |
naive_offset |
Offset used in calculate_naive_coi_offset |
merge_chains |
boolean indicating that all chain results should be merged |
Summarize complexity of infection results from MCMC. Returns a dataframe that contains summaries of the posterior distribution of COI for each biological sample, as well as naive estimates of COI.
Summarize effective COI
summarize_effective_coi( mcmc_results, lower_quantile = 0.025, upper_quantile = 0.975, merge_chains = TRUE )
summarize_effective_coi( mcmc_results, lower_quantile = 0.025, upper_quantile = 0.975, merge_chains = TRUE )
mcmc_results |
Result of calling run_mcmc() |
lower_quantile |
The lower quantile of the posterior distribution to return |
upper_quantile |
The upper quantile of the posterior distribution to return |
merge_chains |
boolean indicating that all chain results should be merged |
Summarize effective COI from MCMC. Returns a dataframe that contains summaries of the posterior distribution of effective COI for each biological sample.
Summarize epsilon_neg
summarize_epsilon_neg( mcmc_results, lower_quantile = 0.025, upper_quantile = 0.975, merge_chains = TRUE )
summarize_epsilon_neg( mcmc_results, lower_quantile = 0.025, upper_quantile = 0.975, merge_chains = TRUE )
mcmc_results |
Result of calling run_mcmc() |
lower_quantile |
The lower quantile of the posterior distribution to return |
upper_quantile |
The upper quantile of the posterior distribution to return |
merge_chains |
boolean indicating that all chain results should be merged |
Summarize epsilon negative results from MCMC. Returns a dataframe that contains summaries of the posterior distribution of epsilon negative for each biological sample.
Summarize epsilon_pos
summarize_epsilon_pos( mcmc_results, lower_quantile = 0.025, upper_quantile = 0.975, merge_chains = TRUE )
summarize_epsilon_pos( mcmc_results, lower_quantile = 0.025, upper_quantile = 0.975, merge_chains = TRUE )
mcmc_results |
Result of calling run_mcmc() |
lower_quantile |
The lower quantile of the posterior distribution to return |
upper_quantile |
The upper quantile of the posterior distribution to return |
merge_chains |
boolean indicating that all chain results should be merged |
Summarize epsilon positive results from MCMC. Returns a dataframe that contains summaries of the posterior distribution of epsilon positive for each biological sample.
Summarize locus heterozygosity
summarize_he( mcmc_results, lower_quantile = 0.025, upper_quantile = 0.975, merge_chains = TRUE )
summarize_he( mcmc_results, lower_quantile = 0.025, upper_quantile = 0.975, merge_chains = TRUE )
mcmc_results |
Result of calling run_mcmc() |
lower_quantile |
The lower quantile of the posterior distribution to return |
upper_quantile |
The upper quantile of the posterior distribution to return |
merge_chains |
Merge the results of multiple chains into a single summary |
Summarize locus heterozygosity from the posterior distribution of sampled allele frequencies.