集群投递模式

简介

SDAS Pipelines 自动化投递任务是一个智能的作业调度系统，用于自动管理和提交SDAS分析流程的作业到PBS/Torque集群。该系统能够：

自动解析依赖关系：根据作业间的依赖关系智能调度
并发控制：限制同时运行的作业数量，避免资源冲突
状态监控：实时监控作业执行状态
错误处理：自动重试失败的作业
详细报告：生成完整的执行报告和日志

系统要求

Python 3.6+
作业调度系统（支持以下任一种）：
- PBS/Torque
- SGE (Sun Grid Engine)
- Slurm
- LSF (IBM Platform Load Sharing Facility)
适当的队列权限
SDAS软件已正确配置

使用步骤

1. 配置pipeline_input.conf文件

在运行SDAS Pipeline之前，需要先配置 pipeline_input.conf 文件。这个文件定义了：

输入数据：h5ad文件路径和分组信息
分析流程：选择要运行的SDAS模块
模块参数：每个模块的具体参数配置
依赖关系：模块间的输入输出关系

1.1 基本配置结构

# 1. 软件路径
SDAS_software = /path/to/SDAS

# 2. 输入数据配置
# 单文件输入
h5ad_files = /path/to/data.h5ad

# 多文件输入（带分组信息）
h5ad_files = S1,group1,A.h5ad;S2,group1,B.h5ad;S3,group2,C.h5ad

# 多文件输入（无分组信息）
h5ad_files = S1,,A.h5ad;S2,,B.h5ad

# 3. 分析流程选择
process = coexpress,spatialDomain,cellAnnotation,cellularNeighborhood,CCI,trajectory,DEG,geneSetEnrichment,TF,PPI,spatialRelate

1.2 模块参数配置示例

配置说明：

参数格式：参数名 = 参数值
空格表示：参数值为空，使用默认值
注释：以 # 开头，用于说明参数含义
路径参数：使用绝对路径，避免相对路径问题

空间基因共表达分析 (coexpress)

# 基本参数
coexpress_input_process = basic
coexpress_method = hotspot  # 可选: hotspot, nest, hdwgcna
coexpress_bin_size = 100
coexpress_selected_genes = top5000

# Hotspot参数
hotspot_fdr_cutoff = 0.05
hotspot_model = normal

细胞类型注释 (cellAnnotation)

# 基本参数
cellAnnotation_input_process = basic
cellAnnotation_method = rctd  # 可选: cell2location, spotlight, rctd, tangram, scimilarity

# RCTD参数
rctd_reference = /path/to/reference.h5ad
rctd_label_key = annotation
rctd_bin_size = 100
rctd_input_gene_symbol_key = real_gene_name
rctd_ref_gene_symbol_key = _index
rctd_filter_rare_cell = 100
rctd_n_cpus = 8

空间结构域识别 (spatialDomain)

# 基本参数
spatialDomain_input_process = basic
spatialDomain_method = graphst

# GraphST参数
graphst_tool = mclust
graphst_bin_size = 100
graphst_n_clusters = 10
graphst_n_hvg = 3000
graphst_gpu_id = -1

1.3 模块依赖关系配置

SDAS模块之间存在依赖关系，通过 *_input_process 参数指定：

# 基础模块（无依赖）
coexpress_input_process = basic
spatialDomain_input_process = basic
cellAnnotation_input_process = basic

# 依赖其他模块
cellularNeighborhood_input_process = cellAnnotation
CCI_input_process = cellularNeighborhood
trajectory_input_process = cellAnnotation
DEG_input_process = spatialDomain
geneSetEnrichment_input_process = spatialDomain
spatialRelate_input_process = cellAnnotation

2. 生成作业配置

配置完成后，运行SDAS Pipeline生成作业配置文件：

python3 SDAS_pipeline.py -c pipeline_input.conf -o ./output

这将生成 all_shell.conf 文件，包含所有作业及其依赖关系。

3. 预览作业（推荐）

在实际提交前，建议先使用dry-run模式预览：

python3 auto_qsub_scheduler.py -c ./output/all_shell.conf -o ./output --dry-run

这将显示：

所有作业的依赖关系
资源需求（CPU、内存）
将要生成的qsub脚本

4. 提交作业

确认无误后，提交作业到队列：

python3 auto_qsub_scheduler.py -c ./output/all_shell.conf -o ./output

auto_qsub_scheduler.py 作业调度系统配置

根据您的集群环境，需要修改 auto_qsub_scheduler.py 中的 create_qsub_script 方法来自定义作业提交脚本的格式。该方法位于文件的 QsubScheduler 类中：

def create_qsub_script(self, shell_file: str, cpu: int, memory: int) -> str:
    """
    生成qsub作业提交脚本
    参数:
        shell_file: 要执行的shell脚本路径
        cpu: CPU核心数
        memory: 内存需求(GB)
    返回:
        生成的qsub脚本内容
    """
    # 在这里根据您的作业调度系统修改脚本模板
    script = f"""#!/bin/bash
#PBS -q {self.queue}
#PBS -N {os.path.basename(shell_file)}
#PBS -o {shell_file}.log
#PBS -j oe
#PBS -l nodes=1:ppn={cpu}
#PBS -l mem={memory}gb

cd $PBS_O_WORKDIR
bash {shell_file}
"""
    return script

您需要：

根据您的作业调度系统（PBS/Torque、SGE、Slurm或LSF）修改脚本模板
确保包含必要的资源配置参数（CPU、内存等）
保留对以下变量的引用：
- self.queue: 队列名称
- shell_file: 执行脚本路径
- cpu: CPU核心数
- memory: 内存需求

测试数据和配置文件

SDAS Pipelines提供了单片和多片测试数据及对应的配置文件，方便用户快速上手和测试系统。

目录结构

SDAS_download/
├── Scripts/
│   └── pipeline_cluster/
│       ├── auto_qsub_scheduler.py      # 自动化投递脚本
│       ├── SDAS_pipeline.py            # Pipeline生成脚本
│       ├── pipeline_input.single_slice.conf   # 单片数据配置示例
│       └── pipeline_input.multiple_slice.conf  # 多片数据配置示例
└── Test_data/
    ├── single_slice/     # 单片测试数据
    │   └── sample.h5ad
    └── multiple_slices/  # 多片测试数据
        ├── P19_NT_transition.h5ad
        ├── P19_T_transition.h5ad
        ├── P34_NT_transition.h5ad
        ├── P34_T_transition.h5ad
        ├── P33_T_transition.h5ad
        └── P36_T_transition.h5ad

单片数据分析配置

pipeline_input.single_slice.conf 针对单个空间转录组切片的分析流程：

输入数据：单个h5ad文件
分析模块：包含大多数SDAS分析模块
特点：
- 简单的数据输入配置
- 完整的模块参数示例
- 适合初次使用的用户

多片数据分析配置

pipeline_input.multiple_slice.conf 针对多个空间转录组切片的分析流程：

输入数据：多个h5ad文件，包含分组信息（如Normal/Tumor）
分析模块：根据实验设计选择合适的模块
特点：
- 展示了多样本输入格式
- 包含组间比较的参数设置
- 适合进行对照分析

测试步骤

1. 单片数据测试

步骤1：准备配置文件

# 1. 复制配置文件到工作目录
cp Scripts/pipeline_cluster/pipeline_input.single_slice.conf ./

# 2. 修改配置文件中的路径
# - SDAS_software路径
# - h5ad_files路径
# - 参考数据路径（如果需要）

步骤2：生成作业配置

# 运行Pipeline生成作业配置
python3 Scripts/pipeline_cluster/SDAS_pipeline.py -c pipeline_input.single_slice.conf -o ./output_single_slice

步骤3：预览作业（推荐）

# 使用dry-run模式预览作业配置
python3 Scripts/pipeline_cluster/auto_qsub_scheduler.py -c ./output/all_shell.conf -o ./output_single_slice --dry-run

步骤4：提交作业

# 实际提交作业到队列
python3 Scripts/pipeline_cluster/auto_qsub_scheduler.py -c ./output/all_shell.conf -o ./output_single_slice

2. 多片数据测试

步骤1：准备配置文件

# 1. 复制配置文件到工作目录
cp Scripts/pipeline_cluster/pipeline_input.multiple_slice.conf ./

# 2. 修改配置文件中的路径
# - SDAS_software路径
# - h5ad_files路径（多个文件路径）
# - 参考数据路径（如果需要）

步骤2：生成作业配置

# 运行Pipeline生成作业配置
python3 Scripts/pipeline_cluster/SDAS_pipeline.py -c pipeline_input.multiple_slice.conf -o ./output_multiple_slice

步骤3：预览作业（推荐）

# 使用dry-run模式预览作业配置
python3 Scripts/pipeline_cluster/auto_qsub_scheduler.py -c ./output/all_shell.conf -o ./output_multiple_slice --dry-run

步骤4：提交作业

# 实际提交作业到队列
python3 Scripts/pipeline_cluster/auto_qsub_scheduler.py -c ./output/all_shell.conf -o ./output_multiple_slice

集群投递模式

集群投递模式

简介

系统要求

使用步骤

1. 配置pipeline_input.conf文件

1.1 基本配置结构

1.2 模块参数配置示例

1.3 模块依赖关系配置

2. 生成作业配置

3. 预览作业（推荐）

4. 提交作业

auto_qsub_scheduler.py 作业调度系统配置

测试数据和配置文件

目录结构

单片数据分析配置

多片数据分析配置

测试步骤

1. 单片数据测试

2. 多片数据测试

results matching ""

No results matching ""