Running and Scaling Cluster Jobs

Day 4 Objectives

How do you structure and document reproducible, scalable work on a cluster?
How do you rerun your pipeline when new data arrives without repeating completed work?
How do you document your process and retrieve results back to your local machine?

On Day 4, we’ll build up from a simple data extraction task to more scalable cluster workflows. This is also a perfect moment to reinforce good project organization practices you should follow in your research projects.

Example Project Directory

Here is a directory structure for this project:

exercises/
│
├── data/
│   └── form_3_10.csv        # Input file to process
│
├── results/                 # Processed outputs go here
│   └── parsed_form3.json
│
├── scripts/
│   ├── extract_form_3_one_file.py
│   ├── extract_form_3_batch.py
│   ├── extract_form_3_batch_checkpoint.py
│   └── extract_form_3_onefile_array.py
│
├── slurm/
│   ├── extract_form_3_one_file.slurm
│   ├── extract_form_3_batch.slurm
│   ├── extract_form_3_batch_checkpoint.slurm
│   ├── extract_form_3_array.slurm
│   │
│   └── logs/                      # Slurm logs directory
│       ├── extract-one-file-758543.out
│       ├── extract-batch-checkpoint-758547.out
│       └── extract-form-3-758549_0.out 
│
├── venv/                  # Virtual environment
├── requirements.txt       # Python dependencies
└── README.md              # Short project documentation

Single File Processing for Testing and Debugging

Script: extract_form_3_one_file.py
Slurm job: extract_form_3_one_file.slurm
Processes one file.
Great for testing your code and debugging errors.

Excercise:

Discuss what happens if you need to process 1,000 files.

Sequential Processing: One Job Handles Many Files

Script: extract_form_3_batch.py
Slurm job: extract_form_3_batch.slurm
Loop over multiple files inside one Python job.

Exercise:

Discuss what happens if the job fails partway through.

Checkpointing for Fault Tolerance

Script: extract_form_3_batch_checkpoint.py
Slurm job: extract_form_3_batch_checkpoint.slurm
Adds checkpointing logic: tracks completed files and resumes on failure.
A common pattern for long-running research jobs.

Exercise:

Simulate a failure by forcing the job to stop after a few files.
Restart the job and verify it skips already completed files.

Parallel Processing with Slurm Job Arrays

Script: extract_form_3_onefile_array.py
Slurm job: extract_form_3_array.slurm
Uses Slurm job arrays to process files independently and in parallel.
Highly efficient for large datasets.

Exercise:

Discuss how to aggregate results.
Discuss the limitations of job arrays on Yen-Slurm.
Discuss the limitations of file system on the Yens.

Copy Results and Document Your Work

Exercise:

Use scp to copy your results directory back to your laptop.
Write a short README.md describing:
- What the pipeline does
- How it runs (Slurm + Python)
- Where the results go
- How to rerun it with new data