Welcome to probitlcm’s documentation!¶
The probitlcm
package implements the model described in the manuscript
Eric Alan Wayman, Steven Andrew Culpepper, Jeff Douglas, and Jesse Bowers. “A restricted latent class model with polytomous attributes and respondent-level covariates.” arXiv preprint arXiv:2408.13143, 2024.
If you make use of this software, please cite the above.
Introduction¶
The simulations and data analyses should be run from “run directories” which contain the necessary input files for the run. There are two types of run directories, one for simulations and one for data analyses. They are referred to as run_directory
and run_dir_data_analysis
, respectively.
Simulation¶
Creating directories and files¶
First, the necessary directories and folders must be created, as follows.
Contents of run_directory
directory:
file:
config_simulation.toml
directory:
<sim_info_dir_name>
# My config file
# these should be absolute paths
# current run_directory used is
# laptop: /home/username/run_directory
# cluster: /home/username/run_directory
laptop_process_dir = "/home/username/process_dir"
cluster_process_dir = "/scratch/users/username/process_dir"
number_of_replics = 100
Each simulation has its own directory, <sim_info_dir_name>
.
Contents of <sim_info_dir_name>
:
json_files
directory
Contents of json_files
directory:
01_fixed_vals.json
01_list_higherlevel_datagen_vals.json
01_list_hyperparams_possible_vals.json
01_list_N_vals.json
02_current_tuning_hyperparam_vals.json
02_list_decided_hyperparam_vals.json
Example 01_fixed_vals.json
:
{
"T": 1,
"covariates": "age_assignedsex",
"chain_length_after_burnin": 10000,
"burnin": 6000,
"q": 5,
"order": 2,
"omega_0": 0.5,
"omega_1": 0.5,
"sigma_beta_sq": 2.0
}
Example 01_list_higherlevel_datagen_vals.json
:
{
"L_k_s": [[2, 2], [3, 3],
[2, 2, 2], [3, 3, 3],
[2, 2, 2, 2]],
"rho": [0, 0.25, 0.5]
}
Example 01_list_hyperparams_possible_vals.json
:
{
"sigma_kappa_sq": [0.00048828125, 0.0009765625, 0.001953125, 0.00390625, 0.0078125, 0.015625, 0.03125, 0.0625, 0.125]
}
Example 01_list_N_vals.json
:
[500, 1500, 3000]
Example 02_current_tuning_hyperparam_vals.json
:
{
"sigma_kappa_sq": 0.125
}
Example 02_list_decided_hyperparam_vals.json
:
{
"sigma_kappa_sq": [
[
500,
1500,
3000
],
[
0.00390625,
0.001953125,
0.0009765625
]
]
}
Creating data generating parameters and other necessary files¶
Once the above directories and files have been created, the probitlcm_extra
package (available at https://github.com/ericwayman01/probitlcm_extra) will be used to create the data generating parameters and other necessary files. Install that package, and then continue as follows:
From run_directory
, run, for example:
python3.X -m probitlcm_extra.build_situations_list --sim_info_dir sim_for_manuscript
where sim_for_manuscript
is an example of <sim_info_dir_name>
. 02_list_situations.json
will be placed in <run_directory>/<sim_info_dir_name>/json_files
as a result.
Now, from run_directory
, for a particular sim_info_dir_name
, for situation_num
ranging from 1 to the total number of situations (a situation is a particular combination of user-selected non-random values for a scenario to be simulated other than sample size; there are 15 situations in the manuscript).
python3.X -m probitlcm_extra.generate_lambda --sim_info_dir_name sim_for_manuscript --situation_num 1
Finally run the command, for example
python3.X -m probitlcm_extra.create_scenario_params --sim_info_dir sim_for_manuscript --situation_num 1
for each situation to create the necessary data-generating parameters.
Performing hyperparameter tuning¶
Once one has prepared the files for the various scenarios (situations combined with sample sizes), before running simulations, one can perform hyperparameter tuning on a particular scenario. The list of hyperparameters and their possible values should be placed in <run_directory>/<sim_info_dir_name>/json_files/01_list_hyperparams_possible_vals.json
. For example:
{
"sigma_kappa_sq": [0.50, 0.75, 1.0, 1.5, 2.0]
}
One runs
python3.X -m probitlcm.hyperparam_tuning --environ laptop --dir_name sim_for_manuscript --type_of_run simulation --scenario_or_setup_number 1
The output will appear in the processing directory, in <process_dir>/<sim_info_dir_name>/hyperparam_tuning/scenario_<scenarionum>_<hyperparamname>_results.txt
. Each line of the file contains the following content:
<hyperparam value>,<percentage of acceptances/rejections>
Currently, for the tuning, the software uses a “warmup” period of 1000 iterations followed a period of 2000 iterations for which the data forming the acceptance percentage is collected.
Based on this, the user should manually add the decided-upon value(s) of the hyperparameter(s) to the 02_list_decided_hyperparam_vals.json
file.
{
"sigma_kappa_sq": [
[
500,
1500,
3000
],
[
0.00390625,
0.001953125,
0.0009765625
]
]
}
Running a simulation¶
The aforementioned files and directories must have been created in order for one to run a simulation.
A scenario is run in two pieces: the scenario_launch
function launches all replications for a given scenario. In a laptop environment, control is returned to the user following the completion of all replications which are run one-by-one. In a cluster environment, all replications are launched simultaneously and control is then returned to the user.
Once the user thinks the replications are completed, the user runs the scenario_replics_postprocess
function to perform post-processing.
Some example simulation run commands:
python3.X -m probitlcm.scenario_launch --environ laptop --sim_info_dir_name sim_for_manuscript --scenarionumber 1
python3.X -m probitlcm.scenario_replics_postprocess --environ laptop --sim_info_dir_name sim_for_manuscript --scenarionumber 1 --statistic_type mean
Once all scenarios for a simulation have finished running, a report on all scenarios can be built using the build_report_all_scenarios
module. The command is
python3.X -m probitlcm.build_report_all_scenarios --environ laptop --sim_info_dir_name sim_for_manuscript --total_num_of_scenarios 45 --statistic_type median
Data analysis¶
Similarly to a simulation, the run_dir_data_analysis
must exist, and have a configuration file as well as a directory for the data set for which one wishes to perform an analysis.
Contents of run directory¶
Contents of run_dir_data_analysis
directory:
file:
config_data_analysis.toml
directory:
<dataset_name>
Example config_data_analysis.toml
:
# My config file
# these should be absolute paths
# current run_directory used is
# laptop: /home/username/run_dir_data_analysis
# cluster: /home/username/run_dir_data_analysis
laptop_process_dir = "/home/username/process_dir_data_analysis"
cluster_process_dir = "/scratch/users/username/process_dir_data_analysis"
number_of_chains = 1
Contents of <dataset_name>
directory:
01_fixed_vals.json
01_list_hyperparams_possible_vals.json
02_decided_hyperparam_vals.json
covariates.csv
M_j_s.json
responses.csv
setup_0001.json
, …,setup_000X.json
Example 01_fixed_vals.json
:
{
"T": 1,
"covariates": "age_assignedsex",
"chain_length_after_burnin": 60000,
"burnin": 60000,
"order": 2,
"omega_0": 0.5,
"omega_1": 0.5,
"sigma_beta_sq": 2.0
}
Example 01_list_hyperparams_possible_vals.json
:
{
"sigma_kappa_sq": [0.00048828125, 0.0009765625, 0.001953125, 0.00390625, 0.0078125, 0.015625, 0.03125, 0.0625, 0.125]
}
Example 02_decided_hyperparam_vals.json
:
{
"sigma_kappa_sq": 0.001953125
}
Example M_j_s.json
:
[3, 3, 3, 5, 5, 3, 3, 4, 5, 5, 5, 5, 5, 3, 5, 5, 3]
Example setup_000x.json
:
{
"K": 2,
"L_k_s": [
2,
2
]
}
Explanation of contents¶
A dataset may have multiple setups (like scenarios for the simulations).
The .csv
files must be comma-separated and should have a header. It is convenient to save such files from R
using, for example, the function call write.csv(df, "nan_test.csv", row.names=FALSE)
(an equivalent function should exist from Python but has not been tested). The option csv_opts::strict
in Armadillo specifies that “missing” values will be converted to arma::datum::nan
type and that seems to apply to the NA
values (really, the strings "NA"
) placed in those cells by R. Presumably any non-numeric type would work, but such situations have not been tested.
The data in responses.csv
must be zero-based, i.e. a column j
must run from 0
through M_j - 1
.
M_j_s.json
should consist of a list whose length is the number of items in responses.csv
and whose elements indicate the number of possible levels for each item. For example:
[3, 3, 3, 5, 5, 3, 3, 4, 5, 5, 5, 5, 5, 3, 5, 5, 3]
Performing hyperparameter tuning¶
One runs
python3.X -m probitlcm.hyperparam_tuning --environ laptop --dir_name sim_for_manuscript --type_of_run data_analysis --scenario_or_setup_number 1
The content of the file 02_decided_hyperparam_vals.json
should be filled in based on the results of hyperparameter tuning.
Running a data analysis¶
An example data analysis run command:
python3.X -m probitlcm.data_analysis_run --environ laptop --dataset rush_cross_sec --setup_num 1
Performing posterior predictive checks¶
Example command:
python3.X -m probitlcm.posterior_predictive_check --environ cluster --dataset rush_cross_sec --setup_num 1 --chainnum 1 --N_files 6
The value for N_files
must match the number of files containing draws from the posterior predictive distribution, which were created during the run of the particular setup_name
.
The run of posterior_predictive_check
produces files bartholomew_histogram.png
and bartholomew_p_value.txt
in the particular chain’s subdirectory.
Seed numbers¶
seed_number = n_replics * (scenario_number - 1) + (replic_num - 1)
License¶
probitlcm
is proved under the GNU General Public License v3, a copy of which can be found in the LICENSE
file. By using, distributing, or contributing to this project, you agree to the terms and conditions of this license.
API Documentation¶
- C++ Extension Module
- probitlcm._core.run_replication
- probitlcm._core.run_data_analysis_chain
- probitlcm._core.load_arma_mat_np
- probitlcm._core.load_arma_umat_np
- probitlcm._core.load_arma_cube_np
- probitlcm._core.load_arma_ucube_np
- probitlcm._core.save_arma_uvec_np
- probitlcm._core.save_arma_umat_np
- probitlcm._core.save_arma_mat_np
- probitlcm._core.generate_design_matrix_np
- probitlcm._core.generate_covariates_scenario_np
- probitlcm._core.calculate_basis_vector
- probitlcm._core.get_design_vectors_from_alpha_np
- probitlcm._core.convert_alpha_to_class_numbers
- probitlcm._core.aggregateAllCounts_np
- Python functions