# Usage: Quick Start Guide for CellART

This comprehensive guide provides a detailed explanation of how to preprocess and run CellART, a framework for extracting single-cell information from high-resolution spatial transcriptomics (ST) datasets. 

> **_NOTE:_** You can check the examples of running CellART on [Xenium CRC dataset](https://cellart.readthedocs.io/en/latest/tutorials/xenium_crc.html) and [VisiumHD CRC dataset](https://cellart.readthedocs.io/en/latest/tutorials/visiumhd_crc.html) for practical demonstrations of its usage.


## Preprocessing

### Input Requirements
For CellART, the following inputs are necessary:

1. **Raw Data from Spatial Transcriptomics Platforms**:
   - **Spatial Transcriptomics Data**: Typically provided as a table containing transcript information (gene counts) and their spatial coordinates.
   - **Staining Images**: DAPI or H&E images, which need to be aligned with the transcriptomics data during preprocessing.

2. **scRNA-seq Reference Dataset**:
   - A reference dataset containing cell type annotations paired with the ST data.

### Standardized Input Format
After preprocessing, the raw data is converted into the following standardized format required by CellART:
- **Gene Map**: A 3D array of shape `H×W×G`, where `H` and `W` are the spatial dimensions, and `G` is the number of unique genes. For platforms such as **VisiumHD**, highly variable genes (HVGs) are selected from the scRNA-seq reference. For other platforms, the intersection of genes between ST and scRNA datasets is used. The resolution is typically set to 1 or 2 µm.
- **Nuclei Segmentation Mask**: A 2D array of shape `H×W`, where each pixel is labeled as `0` for background or assigned a unique cell ID. This mask can be generated using tools such as **Cellpose** or **StarDist**, or directly obtained from the raw dataset.
- **Basis Matrix**: A matrix of shape `C×G`, where `C` is the number of cell types and `G` is the number of genes. This matrix describes the gene expression signatures for each cell type and is derived from the scRNA-seq reference.


## Platform-Specific Preprocessing

### Xenium
The following example demonstrates how to preprocess Xenium data for CellART:
```python
from cellart.utils.preprocess import SingleCellPreprocessor, XeniumPreprocessor
from cellart.utils.io import load_list
import scanpy as sc

# Processed data save dir
save_dir = '/{YOUR_PATH}/Xenium/'
# Transcripts and nucleus boundary files in data directory
transcripts_file = "/{YOUR_DATA_PATH}/transcripts.parquet"
nucleus_boundary_10X = "/{YOUR_DATA_PATH}/nucleus_boundaries.parquet"

st_preprocessor = XeniumPreprocessor(transcripts_file, nucleus_boundary_10X, save_dir)

# Annotated scRNA reference path
sc_adata = sc.read("/{YOUR_SC_DATA}/sc_data.h5ad")
# Remember to specific your celltype_col and make sure your are using raw count data
sc_preprocessor = SingleCellPreprocessor(sc_adata, celltype_col = "celltype", save_path= save_dir, st_gene_list=load_list(save_dir + "/st_gene_list.txt"))

# If you are using Xenium 5K, please using hvg options like in VisiumHD
sc_preprocessor.preprocess()

st_preprocessor.prepare_sst(load_list(save_dir + "/filtered_gene_names.txt"))
st_preprocessor.get_nuclei_segmentation()
```


---

### VisiumHD
For VisiumHD, the following example demonstrates how to preprocess data:
```python
import os
# Modify the max image pixels limit for large images
os.environ["OPENCV_IO_MAX_IMAGE_PIXELS"] = pow(2,40).__str__()
import cv2
from cellart.utils.preprocess import SingleCellPreprocessor, VisiumHDPreprocessor
from cellart.utils.io import load_list
import scanpy as sc

# Processed data save dir
save_dir = '/{YOUR_PATH}/VisiumHD/'
# Path to 002um spot data
path = "/{YOUR_DATA_PATH}/square_002um/"
# Path to he
source_image_path = "/{YOUR_DATA_PATH}/XXX_tissue_image.tif" # or btf
# Path to spatial dir
spaceranger_image_path = "/{YOUR_DATA_PATH}/spatial"

st_preprocessor = VisiumHDPreprocessor(path, source_image_path, spaceranger_image_path, save_dir)
st_preprocessor.get_nuclei_segmentation()

# Annotated scRNA reference path
sc_adata = sc.read("/{YOUR_SC_DATA}/sc_data.h5ad")
# Remember to specific your celltype_col and make sure your are using raw count data
sc_preprocessor = SingleCellPreprocessor(sc_adata, celltype_col = "celltype", save_path= save_dir, st_gene_list=load_list(save_dir + "/st_gene_list.txt"))

# Selecting highly variable genes 2000 / 3000
sc_preprocessor.preprocess(hvg_method="seurat_v3", n_hvg=3000)

st_preprocessor.prepare_sst(load_list(save_dir + "/filtered_gene_names.txt"))
```

---

### Stereoseq
For Stereoseq, the following steps illustrate how to preprocess data:
```python
# Stereoseq need to use the package to read the raw data
import stereo

gef_data = stereo.io.read_gef("/{YOUR_DATA_PATH}/XXX.raw.gef", bin_size=4) # 2um resolution
adata = stereo.io.stereo_to_anndata(gef_data)
# You may need additional step to map the ensembl_id to gene name in the reference
adata.write_h5ad("/{YOUR_DATA_PATH}/bin4.h5ad")

from cellart.utils.preprocess import SingleCellPreprocessor
import scanpy as sc
import tifffile
import cv2
import scipy
from scipy.sparse import coo_matrix
import numpy as np
import os
import multiprocessing as mp
import warnings
warnings.filterwarnings("ignore")

img = tifffile.imread("/{YOUR_DATA_PATH}/XXX_ssDNA_regist.tif")
adata = sc.read_h5ad("/{YOUR_DATA_PATH}/bin4.h5ad")

# Processed data save dir
save_dir = '/{YOUR_PATH}/Stereoseq/'


# Match the 2 um resolution
resized = cv2.resize(img, (img.shape[1] // 4, img.shape[0] // 4), interpolation=cv2.INTER_AREA)
# Cellpose to segment the nuclei
def segment_dapi(img, diameter=None, use_cpu=False):
     """Segment nuclei in DAPI image using Cellpose"""
     use_gpu = True if not use_cpu else False
     model = models.Cellpose(gpu=use_gpu, model_type="cyto")
     channels = [0, 0]
     mask, _, _, _ = model.eval(img, diameter=diameter, channels=channels)
     print(mask.max())
     return mask
# Adjust the diameter to get a better cellpose result
nuclei_seg = segment_dapi(resized, diameter=2)
# Save nuclei segmentation
nuclei_seg = np.save("/import/home3/yhchenmath/Dataset/DeconvSeg/stereoseq_mouse_brain/nuclei_seg.npy")

# Annotated scRNA reference path
sc_adata = sc.read("/{YOUR_SC_DATA}/sc_data.h5ad")
# Select genes in sc_adata that are also in st data
common_genes = sc_adata.var_names.intersection(adata.var_names)
sc_preprocessor = SingleCellPreprocessor(sc_adata, celltype_col = "celltype", save_path= save_dir, st_gene_list=common_genes)

sc_preprocessor.preprocess(hvg_method="seurat_v3", n_hvg=3000)

# Process to get gene map
hvg_genes = pd.read_csv("/{YOUR_SAVE_DIR}/filtered_gene_names.txt", header=None)
# Tolist
hvg_genes = hvg_genes[0].tolist()
# Select
adata_hvg = adata[:, hvg_genes].copy()
# Filter 
sc.pp.filter_cells(adata_hvg, min_genes=1)

adata_st = adata_hvg.copy()
# x//4, y//4 to fit the resolution
adata_st.obs['x'] = adata_st.obs['x'] // 4
adata_st.obs['y'] = adata_st.obs['y'] // 4
adata_st.var_names_make_unique()
map_width = img.shape[0] // 4
map_height = img.shape[1] // 4
adata_st.obs_names_make_unique()


def process_gene_chunk(adata_sub, temp_dir, map_height, map_width):
    for gene in list(adata_sub.var_names):
        temp_df = adata_sub.obs[['x', 'y']]
        if scipy.sparse.issparse(adata_sub.X):
            temp_df[gene] = adata_sub[:, gene].X.toarray().flatten()
        else:
            temp_df[gene] = adata_sub[:, gene].X.flatten()
        map = coo_matrix((temp_df[gene], (temp_df['x'], temp_df['y'])), shape=(map_height, map_width)).toarray()
        np.save(os.path.join(temp_dir, gene + ".npy"), map)
        print(f"Gene {gene} processed.")

# Number of multiprocess
n_processes = 25
filtered_gene_list = hvg_genes
gene_names_chunks = np.array_split(filtered_gene_list, n_processes)
processes = []

temp_dir = os.path.join("/PATH/per_gene_map") # A temp dir to store temp data
os.makedirs(temp_dir, exist_ok=True)

for i, gene_chunk in enumerate(gene_names_chunks):
    adata_sub = adata_st[:, gene_chunk].copy()
    p = mp.Process(target=process_gene_chunk, args= (adata_sub, temp_dir, map_height, map_width))
    processes.append(p)
    p.start()

for p in processes:
    p.join()

# Combine channel-wise
from tqdm.notebook import tqdm
gene_map = None
for i, gene in tqdm(enumerate(filtered_gene_list), total=len(filtered_gene_list)):
    map = np.load(os.path.join(temp_dir, gene + ".npy"))
    if gene_map is None:
        gene_map = np.zeros((map.shape[0], map.shape[1], len(filtered_gene_list)), dtype=np.uint8)
        gene_map[:, :, i] = map
    else:
        gene_map[:, :, i] = map

# Save the combined map
np.save("/{YOUR_SAVE_DIR}/gene_map.npy", gene_map)
```
---

### MERFISH


The transcripts.csv file can be convert to Xenium-like format and then processed in the same way (simply rename the column). The image can be matched and use Cellpose to segment like Stereoseq. You can also directly use the nuclei boundaries from platform, details can follow the official tutorials for different versions of MERFISH. 


## Run CellART

After preprocessing (or manually converting raw data into the required format), you will have the following files in your save directory (`save_dir` specified in the preprocessing code):

- **gene_map.npy**: A [HxWxG numpy file], where each element represents the gene count at a specific spatial location.
- **segmentation_mask.npy**: A [HxW numpy file] containing the nuclei segmentation mask.
- **basis.npy**: A [CxG numpy file] representing the cell type signature matrix derived from the scRNA reference.
- **filtered_gene_names.txt**: A text file listing the gene names corresponding to the order of the `G` channel in `gene_map.npy` and `basis.npy`.
- **celltype_names.txt**: A text file listing the cell type names matching the order of the `C` channel in `basis.npy`.

Once the preprocessing is complete, you can run CellART using the following script:

> **_NOTE:_** It is **highly recommended** to register an account with [Weights & Biases (wandb)](https://wandb.ai/) and use it to monitor the training and analysis process in CellART. Wandb provides a powerful platform for tracking experiments

```python
import cellart
from pathlib import Path
import wandb
import os

# Preprocessed data
save_dir = "{YOUR_PREVIOUS_SAVE_DIR}"
# Directory to store all results
log_dir = "/{YOUR_RESULT_LOG_PATH}/cellart_result/"

manager = cellart.ExperimentManager(
    # Basic input data settings (must be specified)
    gene_map=os.path.join(save_dir, "gene_map.npy"),
    nuclei_mask=os.path.join(save_dir, "segmentation_mask.npy"),
    basis=os.path.join(save_dir, "basis.npy"),
    gene_names=os.path.join(save_dir, "filtered_gene_names.txt"),
    celltype_names=os.path.join(save_dir, "celltype_names.txt"),
    log_dir=log_dir,

    # Training parameters (adjust based on convergence and wandb visualization)
    epoch=200, 
    seg_training_epochs=10,
    deconv_warmup_epochs=100,

    pred_period=50,
    gpu="0"
)

# Update options
opt = manager.get_opt()
print(opt)

# Set up wandb for logging and visualization
run = wandb.init(project="CellART", dir=manager.get_log_dir(), config=opt,
                 name=os.path.basename(os.path.normpath(manager.get_log_dir())))

# Set up dataset
dataset = cellart.SSTDataset(manager)
gene_map_shape = dataset.gene_map.shape

# Initialize and train the CellART model
model = cellart.CellARTModel(manager, gene_map_shape, len(dataset.coords_starts))
model.train_model(dataset)
```

---

### How to Select a Proper Epoch Number?

The appropriate epoch number for training the CellART model may vary depending on the complexity of the dataset. Below are general guidelines for selecting the epoch parameters:

1. **Large and Complex Datasets (default setting for most of the datasets in paper)**:
   For datasets with diverse cell types or complex gene expression patterns (like more than 3K genes selected), use higher epoch numbers to ensure sufficient training:
   ```python
   epoch = 400
   seg_training_epochs = 15
   deconv_warmup_epochs = 200
   ```
   Use the `wandb` package to monitor the training process, visualize the annotation results, and evaluate whether they meet expectations. If convergence is achieved earlier than expected, you can terminate training prematurely and check the results.

2. **Simpler Datasets**:
   For datasets with simpler cell type compositions (e.g., fewer subtypes, fewer gene number), the model typically converges faster. In these cases, you can use lower epoch numbers:
   ```python
   epoch = 200
   seg_training_epochs = 10
   deconv_warmup_epochs = 100
   ```


---

### Output

After training, the results will be stored in the `log_dir` directory. Key output files include:

- **epoch_{PRED_EPOCH}/cell_deconv.h5ad**: An [Annotated data](https://anndata.readthedocs.io/en/stable/) file containing segmentation and cell type annotation results.
- **new_segmentation_mask.npy**: A numpy file with the updated segmentation mask, where cell IDs correspond to the `obs_names` in the `cell_deconv.h5ad` file.