scSHARP Usage


class scSHARP.sc_sharp.scSHARP(data_path, tools, marker_path, preds_path=None, neighbors=2, config='2_40.txt', ncells='all', anndata_layer=None, anndata_use_raw=False)

Class for prediction, analysis, and visualization of cell type based on DGE matrix

scSHARP object manages I/O directories, running of component tools, as well as prediction and analysis using scSHARP model.


data_path: path to DGE matrix csv preds_path: path to component tool output file csv format tools: list of component tool string names marker_path: path to marker gene txt file neighbors: number of neighbors used for tool consensus default value is 2 config: config file for the ncells: number of cells from dataset to use for model prediction pre_processed: boolean. True when dataset has been preprocessed


Returns correlation values and heatmap between tool columns

expression_plots(n=5, genes=None)

Generates violoin plots of gene expression.



number of highly attributed genes to show


list of genes to show




Returns component predictions if available

heat_map(out_dir=None, n=5)

Displays heat map based on model interpretation


att_df: attribute dataframe generated from scSHARP.run_interpretation() out_dir: optional output directory to save heatmap as pdf. (default: None) n: number of most expressed genes per cell type to display


ax: matplotlib ax object for heatmap


returns knn consensus predictions for unconfidently labled cells based on k nearest confident votes


Load model as serialized object at specified path

model_eval(config, batch_size, neighbors, dropout, random_inits, training_epochs=150)

Evaluates a model for a single hyperparameter configuration

prepare_data(thresh=0.51, normalize=True, scale=True, targetsum=10000.0, run_pca=True, comps=500, cell_fil=0, gene_fil=0)

Prepares dataset for training and prediction


Runs gradient-based model interpretation


Interpretation requires a trained model. Model is trained by scSHARP.run_prediction()


int_df: The interpretation dataframe with rows corresponding with genes and columns corresponding to cell types.

Values indicate the model’s gradient of cell type with respect to the corresponding input gene after absolute value and scaling by cell type

run_prediction(training_epochs=150, thresh=0.51, batch_size=40, seed=8)

Trains GCN modle on consensus labels and returns predictions


training_epochs: Number of epochs model will be trained on.

For each epoch the model calculates predictions for the entire training dataset, adjusting model weights one or more times.

thresh: voting threshold for component tools (default: 0.51) batch_size: number of training examples passed through model before calculating gradients (default: 40) seed: random seed (default: 8)


Tuple of:

final_preds: predictions on dataset after final training epoch train_nodes: confident labels used for training test_nodes: confident labels used for evaluation (masked labels) keep_cells: cells used in training process, determined during data preprocessing conf_scores: model confidence values for each prediction

run_tools(out_path, ref_path, ref_label_path)

Uses subprocess to run component tools in R.



Output path


Path to reference dge


Path to labels for reference data set



True if successful, false if not


Save model as serialized object at specified path


function that maps preds back to cell types



encodes predictions for each cell with 1 for each prediction

scSHARP.utilities.factorize_df(df, all_cells)

factorizes all columns in pandas df

scSHARP.utilities.filter_scores(scores, thresh=0.5)

filters out score columns with NAs > threshold

scSHARP.utilities.get_consensus_labels(encoded_y, necessary_vote)

method that gets consensus vote of multiple prediction tools If vote is < 1 then taken as threshold pct to be >= to


Gets max consensus

scSHARP.utilities.knn_consensus(counts, preds, n_neighbors, converge=False, one_epoch=False)

Do kNN consensus, iterate until x% do not change

scSHARP.utilities.knn_consensus_batch(counts, preds, n_neighbors, converge=False, one_epoch=False, batch_size=1000, keep_conf=False)

Do kNN consensus, iterate until x% do not change

scSHARP.utilities.load_model(file_path, target_types)

loads model from json format

scSHARP.utilities.mask_labels(labels, masking_pct)

masks labels for training

Randomly masks a specified portion of the labels, substituting their value for -1


labels: list of labels masking_pct: float value for proportion of masked rows


Tuple of:

labels: original list of labels masked_labels: copy of original labels, with masking applied

scSHARP.utilities.pred_accuracy(preds, real)

returns accuracy of predictions

scSHARP.utilities.preprocess(data, normalize=True, scale=False, targetsum=10000.0, run_pca=True, comps=500, cell_fil=0, gene_fil=0)

Preprocesses raw counts DGE matrix

The default parameter values assume filtered, but not normalized DGE counts matrix with rows representing cells and columns representing genes


normalize: bool

row norm and lognorm

scale: bool

scale by gene to mean 0 and std 1

targetsum: float

row norm then multiply by target sum

run_pca: bool

Whether or not to run PCA

comps: int

how many components to use for PCA

cel_fil: int

Filter param. Minimum number of cells containing a given gene to be included

gene_fil: int

Filter param. Minimum number of genes containing a given cell to be included


preprocessed dataset as an nD-array


parses marker file


Tuple of:

markers: list of marker genes marker_names: list of string gene names

scSHARP.utilities.weighted_encode(df, encoded_y, tool_weights)

More advanced consensus method df: cells x tools tool_weights: cell_types x tools


scSHARP.interpret.interpret_model(model, X, predictions, genes, batch_size, device, batches=None)

Performs PCA interpretation on model


model: X: predictions: genes: gene names for output dataframe labels batch_size: size of batch for deeplift interpretation device: torch device for running deeplift computations batches: Number of batches to run dataset on deeplift. If None, run deeplift on entire dataset (default: None)