scSHARP Usage

scSHARP

class scSHARP.sc_sharp.scSHARP(data_path, tools, marker_path, preds_path=None, neighbors=2, config='2_40.txt', ncells='all', anndata_layer=None, anndata_use_raw=False)

Class for prediction, analysis, and visualization of cell type based on DGE matrix

scSHARP object manages I/O directories, running of component tools, as well as prediction and analysis using scSHARP model.

Attributes:

data_path: path to DGE matrix csv preds_path: path to component tool output file csv format tools: list of component tool string names marker_path: path to marker gene txt file neighbors: number of neighbors used for tool consensus default value is 2 config: config file for the ncells: number of cells from dataset to use for model prediction pre_processed: boolean. True when dataset has been preprocessed

component_correlation()

Returns correlation values and heatmap between tool columns

expression_plots(n=5, genes=None)

Generates violoin plots of gene expression.

Parameters

nint

number of highly attributed genes to show

geneslist

list of genes to show

Returns

Plot

get_component_preds(factorized=False)

Returns component predictions if available

heat_map(out_dir=None, n=5)

Displays heat map based on model interpretation

Parameters

att_df: attribute dataframe generated from scSHARP.run_interpretation() out_dir: optional output directory to save heatmap as pdf. (default: None) n: number of most expressed genes per cell type to display

Returns

ax: matplotlib ax object for heatmap

knn_consensus(k=5)

returns knn consensus predictions for unconfidently labled cells based on k nearest confident votes

load_model(file_path)

Load model as serialized object at specified path

model_eval(config, batch_size, neighbors, dropout, random_inits, training_epochs=150)

Evaluates a model for a single hyperparameter configuration

prepare_data(thresh=0.51, normalize=True, scale=True, targetsum=10000.0, run_pca=True, comps=500, cell_fil=0, gene_fil=0)

Prepares dataset for training and prediction

run_interpretation()

Runs gradient-based model interpretation

Note

Interpretation requires a trained model. Model is trained by scSHARP.run_prediction()

Returns

int_df: The interpretation dataframe with rows corresponding with genes and columns corresponding to cell types.

Values indicate the model’s gradient of cell type with respect to the corresponding input gene after absolute value and scaling by cell type

run_prediction(training_epochs=150, thresh=0.51, batch_size=40, seed=8)

Trains GCN modle on consensus labels and returns predictions

Parameters

training_epochs: Number of epochs model will be trained on.

For each epoch the model calculates predictions for the entire training dataset, adjusting model weights one or more times.

thresh: voting threshold for component tools (default: 0.51) batch_size: number of training examples passed through model before calculating gradients (default: 40) seed: random seed (default: 8)

Returns

Tuple of:

final_preds: predictions on dataset after final training epoch train_nodes: confident labels used for training test_nodes: confident labels used for evaluation (masked labels) keep_cells: cells used in training process, determined during data preprocessing conf_scores: model confidence values for each prediction

run_tools(out_path, ref_path, ref_label_path)

Uses subprocess to run component tools in R.

Parameters

out_pathstr

Output path

ref_pathstr

Path to reference dge

ref_label_pathstr

Path to labels for reference data set

Returns

bool

True if successful, false if not

save_model(file_path)

Save model as serialized object at specified path

unfactorize_preds()

function that maps preds back to cell types

Utilities

scSHARP.utilities.encode_predictions(df)

encodes predictions for each cell with 1 for each prediction

scSHARP.utilities.factorize_df(df, all_cells)

factorizes all columns in pandas df

scSHARP.utilities.filter_scores(scores, thresh=0.5)

filters out score columns with NAs > threshold

scSHARP.utilities.get_consensus_labels(encoded_y, necessary_vote)

method that gets consensus vote of multiple prediction tools If vote is < 1 then taken as threshold pct to be >= to

scSHARP.utilities.get_max_consensus(votes)

Gets max consensus

scSHARP.utilities.knn_consensus(counts, preds, n_neighbors, converge=False, one_epoch=False)

Do kNN consensus, iterate until x% do not change

scSHARP.utilities.knn_consensus_batch(counts, preds, n_neighbors, converge=False, one_epoch=False, batch_size=1000, keep_conf=False)

Do kNN consensus, iterate until x% do not change

scSHARP.utilities.load_model(file_path, target_types)

loads model from json format

scSHARP.utilities.mask_labels(labels, masking_pct)

masks labels for training

Randomly masks a specified portion of the labels, substituting their value for -1

Parameters

labels: list of labels masking_pct: float value for proportion of masked rows

Returns

Tuple of:

labels: original list of labels masked_labels: copy of original labels, with masking applied

scSHARP.utilities.pred_accuracy(preds, real)

returns accuracy of predictions

scSHARP.utilities.preprocess(data, normalize=True, scale=False, targetsum=10000.0, run_pca=True, comps=500, cell_fil=0, gene_fil=0)

Preprocesses raw counts DGE matrix

The default parameter values assume filtered, but not normalized DGE counts matrix with rows representing cells and columns representing genes

Parameters

normalize: bool

row norm and lognorm

scale: bool

scale by gene to mean 0 and std 1

targetsum: float

row norm then multiply by target sum

run_pca: bool

Whether or not to run PCA

comps: int

how many components to use for PCA

cel_fil: int

Filter param. Minimum number of cells containing a given gene to be included

gene_fil: int

Filter param. Minimum number of genes containing a given cell to be included

Returns

preprocessed dataset as an nD-array

scSHARP.utilities.read_marker_file(file_path)

parses marker file

Returns

Tuple of:

markers: list of marker genes marker_names: list of string gene names

scSHARP.utilities.weighted_encode(df, encoded_y, tool_weights)

More advanced consensus method df: cells x tools tool_weights: cell_types x tools

Interpret

scSHARP.interpret.interpret_model(model, X, predictions, genes, batch_size, device, batches=None)

Performs PCA interpretation on model

Parameters

model: X: predictions: genes: gene names for output dataframe labels batch_size: size of batch for deeplift interpretation device: torch device for running deeplift computations batches: Number of batches to run dataset on deeplift. If None, run deeplift on entire dataset (default: None)