scSHARP Usage¶
scSHARP¶
- class scSHARP.sc_sharp.scSHARP(data_path, tools, marker_path, preds_path=None, neighbors=2, config='2_40.txt', ncells='all', anndata_layer=None, anndata_use_raw=False)¶
Class for prediction, analysis, and visualization of cell type based on DGE matrix
scSHARP object manages I/O directories, running of component tools, as well as prediction and analysis using scSHARP model.
Attributes:¶
data_path: path to DGE matrix csv preds_path: path to component tool output file csv format tools: list of component tool string names marker_path: path to marker gene txt file neighbors: number of neighbors used for tool consensus default value is 2 config: config file for the ncells: number of cells from dataset to use for model prediction pre_processed: boolean. True when dataset has been preprocessed
- component_correlation()¶
Returns correlation values and heatmap between tool columns
- expression_plots(n=5, genes=None)¶
Generates violoin plots of gene expression.
Parameters¶
- nint
number of highly attributed genes to show
- geneslist
list of genes to show
Returns¶
Plot
- get_component_preds(factorized=False)¶
Returns component predictions if available
- heat_map(out_dir=None, n=5)¶
Displays heat map based on model interpretation
Parameters¶
att_df: attribute dataframe generated from scSHARP.run_interpretation() out_dir: optional output directory to save heatmap as pdf. (default: None) n: number of most expressed genes per cell type to display
Returns¶
ax: matplotlib ax object for heatmap
- knn_consensus(k=5)¶
returns knn consensus predictions for unconfidently labled cells based on k nearest confident votes
- load_model(file_path)¶
Load model as serialized object at specified path
- model_eval(config, batch_size, neighbors, dropout, random_inits, training_epochs=150)¶
Evaluates a model for a single hyperparameter configuration
- prepare_data(thresh=0.51, normalize=True, scale=True, targetsum=10000.0, run_pca=True, comps=500, cell_fil=0, gene_fil=0)¶
Prepares dataset for training and prediction
- run_interpretation()¶
Runs gradient-based model interpretation
Note¶
Interpretation requires a trained model. Model is trained by scSHARP.run_prediction()
Returns¶
- int_df: The interpretation dataframe with rows corresponding with genes and columns corresponding to cell types.
Values indicate the model’s gradient of cell type with respect to the corresponding input gene after absolute value and scaling by cell type
- run_prediction(training_epochs=150, thresh=0.51, batch_size=40, seed=8)¶
Trains GCN modle on consensus labels and returns predictions
Parameters¶
- training_epochs: Number of epochs model will be trained on.
For each epoch the model calculates predictions for the entire training dataset, adjusting model weights one or more times.
thresh: voting threshold for component tools (default: 0.51) batch_size: number of training examples passed through model before calculating gradients (default: 40) seed: random seed (default: 8)
Returns¶
- Tuple of:
final_preds: predictions on dataset after final training epoch train_nodes: confident labels used for training test_nodes: confident labels used for evaluation (masked labels) keep_cells: cells used in training process, determined during data preprocessing conf_scores: model confidence values for each prediction
- run_tools(out_path, ref_path, ref_label_path)¶
Uses subprocess to run component tools in R.
Parameters¶
- out_pathstr
Output path
- ref_pathstr
Path to reference dge
- ref_label_pathstr
Path to labels for reference data set
Returns¶
- bool
True if successful, false if not
- save_model(file_path)¶
Save model as serialized object at specified path
- unfactorize_preds()¶
function that maps preds back to cell types
Utilities¶
- scSHARP.utilities.encode_predictions(df)¶
encodes predictions for each cell with 1 for each prediction
- scSHARP.utilities.factorize_df(df, all_cells)¶
factorizes all columns in pandas df
- scSHARP.utilities.filter_scores(scores, thresh=0.5)¶
filters out score columns with NAs > threshold
- scSHARP.utilities.get_consensus_labels(encoded_y, necessary_vote)¶
method that gets consensus vote of multiple prediction tools If vote is < 1 then taken as threshold pct to be >= to
- scSHARP.utilities.get_max_consensus(votes)¶
Gets max consensus
- scSHARP.utilities.knn_consensus(counts, preds, n_neighbors, converge=False, one_epoch=False)¶
Do kNN consensus, iterate until x% do not change
- scSHARP.utilities.knn_consensus_batch(counts, preds, n_neighbors, converge=False, one_epoch=False, batch_size=1000, keep_conf=False)¶
Do kNN consensus, iterate until x% do not change
- scSHARP.utilities.load_model(file_path, target_types)¶
loads model from json format
- scSHARP.utilities.mask_labels(labels, masking_pct)¶
masks labels for training
Randomly masks a specified portion of the labels, substituting their value for -1
Parameters¶
labels: list of labels masking_pct: float value for proportion of masked rows
Returns¶
- Tuple of:
labels: original list of labels masked_labels: copy of original labels, with masking applied
- scSHARP.utilities.pred_accuracy(preds, real)¶
returns accuracy of predictions
- scSHARP.utilities.preprocess(data, normalize=True, scale=False, targetsum=10000.0, run_pca=True, comps=500, cell_fil=0, gene_fil=0)¶
Preprocesses raw counts DGE matrix
The default parameter values assume filtered, but not normalized DGE counts matrix with rows representing cells and columns representing genes
Parameters¶
- normalize: bool
row norm and lognorm
- scale: bool
scale by gene to mean 0 and std 1
- targetsum: float
row norm then multiply by target sum
- run_pca: bool
Whether or not to run PCA
- comps: int
how many components to use for PCA
- cel_fil: int
Filter param. Minimum number of cells containing a given gene to be included
- gene_fil: int
Filter param. Minimum number of genes containing a given cell to be included
Returns¶
preprocessed dataset as an nD-array
- scSHARP.utilities.read_marker_file(file_path)¶
parses marker file
Returns¶
- Tuple of:
markers: list of marker genes marker_names: list of string gene names
- scSHARP.utilities.weighted_encode(df, encoded_y, tool_weights)¶
More advanced consensus method df: cells x tools tool_weights: cell_types x tools
Interpret¶
- scSHARP.interpret.interpret_model(model, X, predictions, genes, batch_size, device, batches=None)¶
Performs PCA interpretation on model
Parameters¶
model: X: predictions: genes: gene names for output dataframe labels batch_size: size of batch for deeplift interpretation device: torch device for running deeplift computations batches: Number of batches to run dataset on deeplift. If None, run deeplift on entire dataset (default: None)