Running and Scaling Cluster Jobs
Day 4 Objectives
How do you structure and document reproducible, scalable work on a cluster?
How do you rerun your pipeline when new data arrives without repeating completed work?
How do you document your process and retrieve results back to your local machine?
On Day 4, we’ll build up from a simple data extraction task to more scalable cluster workflows. This is also a perfect moment to reinforce good project organization practices you should follow in your research projects.
Example Project Directory
Here is a directory structure for this project:
exercises/
│
├── data/
│ └── form_3_10.csv # Input file to process
│
├── results/ # Processed outputs go here
│ └── parsed_form3.json
│
├── scripts/
│ ├── extract_form_3_one_file.py
│ ├── extract_form_3_batch.py
│ ├── extract_form_3_batch_checkpoint.py
│ └── extract_form_3_onefile_array.py
│
├── slurm/
│ ├── extract_form_3_one_file.slurm
│ ├── extract_form_3_batch.slurm
│ ├── extract_form_3_batch_checkpoint.slurm
│ ├── extract_form_3_array.slurm
│ │
│ └── logs/ # Slurm logs directory
│ ├── extract-one-file-758543.out
│ ├── extract-batch-checkpoint-758547.out
│ └── extract-form-3-758549_0.out
│
├── venv/ # Virtual environment
├── requirements.txt # Python dependencies
└── README.md # Short project documentation
Single File Processing for Testing and Debugging
- Script:
extract_form_3_one_file.py
- Slurm job:
extract_form_3_one_file.slurm
- Processes one file.
- Great for testing your code and debugging errors.
Excercise:
- Discuss what happens if you need to process 1,000 files.
Sequential Processing: One Job Handles Many Files
- Script:
extract_form_3_batch.py
- Slurm job:
extract_form_3_batch.slurm
- Loop over multiple files inside one Python job.
Exercise:
- Discuss what happens if the job fails partway through.
Checkpointing for Fault Tolerance
- Script:
extract_form_3_batch_checkpoint.py
- Slurm job:
extract_form_3_batch_checkpoint.slurm
- Adds checkpointing logic: tracks completed files and resumes on failure.
- A common pattern for long-running research jobs.
Exercise:
- Simulate a failure by forcing the job to stop after a few files.
- Restart the job and verify it skips already completed files.
Parallel Processing with Slurm Job Arrays
- Script:
extract_form_3_onefile_array.py
- Slurm job:
extract_form_3_array.slurm
- Uses Slurm job arrays to process files independently and in parallel.
- Highly efficient for large datasets.
Exercise:
- Discuss how to aggregate results.
- Discuss the limitations of job arrays on Yen-Slurm.
- Discuss the limitations of file system on the Yens.
Copy Results and Document Your Work
Exercise:
- Use
scp
to copy yourresults
directory back to your laptop. Write a short
README.md
describing:- What the pipeline does
- How it runs (Slurm + Python)
- Where the results go
- How to rerun it with new data