Data Management
Head Node
DELL Ubuntu PC is primarily used for running genomic data pipelines and workflows, as well as most genomic data interactive analyses.
The data pipelines and workflows generate a lot of large temporary files that can use a disproportionate amount of storage space if not maintained.
The data storage space (3.5 Tb in 3 individual partitions) is shared among all users of the DELL Ubuntu PC, and can affect the other users.
Home Directories
- /home
- 3.5 Tb shared by all users. If one user has 1 Tb of data in their home, it will take away 1 Tb from all others.
Pipeline and Workflow “working” Disk Partitions
- 2 directories are mapped to 2 individual disk partitions
- Each has 3.5 Tb shared by all users
Using the following ‘temporary’ directories will allow us to easily clean old files from prior pipeline executions.
├── /mnt/bioinformatics
├── nextflow_tmp
├── alessiogalimi
├── francescalastname
├── jennysmith
├── marikaguercio
└── valentina
├── /mnt/bioinformatics_datasets
├── nextflow_tmp/
├── alessiogalimi
├── francescalastname
├── jennysmith
├── marikaguercio
└── valentina
Network Drives
These are for data back-ups and storage.
Define Data and File Locations
The base directory for the metadata, raw data, processed data, and analysis outputs are found at:
- /mnt/network_drives/storage-bioinfo_labq/
This file path can be accessed from the DELL Ubuntu PC (head node) in the office.
The directory structure is simple, but there is a “data flow” or “data life cycle” that we will need to manage. The main directory structure is below:
0001_ngs_raw_data/
0002_ngs_metadata/
0003_ngs_processed_data/
0004_ngs_analysis/
We can use the following naming and file heirarchy to help organize datasets for the long term.
For example
000X_ngs_XXXXX
├── pi_lastname_first_initial
├── collaborator_lastname_first_initial
├── [PROJECT_NAME]
├── [YEAR-MONTH-DAY]_[DATA-TYPE]_[SHORT-DESCRIPTION]
Data Life Cycle
STEP 1 RAW DATA DUMP
Output raw files from Basespace, Nanopore, etc.
# for example a project called 'T-ALL'
0001_raw_data
├── quintarelli_c
├── lastname_f
├── T-ALL
├── WTS-XXXX
├── FB_2025_00002321314
# for example a project called 'CAR-GD2_mbIL15'
0001_raw_data
├── quintarelli_c
├── guercio_m
├──CAR-GD2_mbIL15
├── BaseSPACE-XXXXX/
STEP 2 INITIAL DATA CLEANING / ANALYSIS DIR
Imagine this is sort of a bottom-up strategy:
0004_ngs_analysis –> 0002_ngs_metadata
0004_ngs_analysis –> 0003_ngs_processed_data
This is because the next step after raw data is generated is to process / analyze those raw data inputs.
EXAMPLE with the ‘DEMO’ project:
0004_ngs_analysis/
├── quintarelli_c/
├── lastname_f/
├── DEMO/
├── 2025-01-01_RNAseq_Bulk_DEMO
├── save cleaned metadata
├── '0002_metadata/quintarelli_c/lastname_f/DEMO/2025-01-01_DEMO_RNAseq_Bulk_sample_annotations.csv'
├── 2025-02-01_RNAseq_Bulk_DEMO_quant
├── read in metadata
├── '0002_metadata/quintarelli_c/lastname_f/DEMO/2025-01-01_DEMO_RNAseq_Bulk_sample_annotations.csv'
├── save **FINAL** rnaseq counts
├── '0004_ngs_processesed/quintarelli_c/lastname_f/DEMO/2025-02-01_DEMO_RNAseq_Bulk_raw_counts.csv'
├── 2025-03-01_WGS_Nanopore_DEMO
├── read in metada
├── '0002_metadata/quintarelli_c/lastname_f/DEMO/[MOST-CURRENT-DATE]_DEMO_RNAseq_Bulk_sample_annotations.csv'
├── read in cleaned rnaseq counts
├── '0004_ngs_processesed/quintarelli_c/lastname_f/DEMO/2025-02-01_DEMO_RNAseq_Bulk_raw_counts.csv'
├── save **FINAL** variant calls
├── `0004_ngs_processesed/quintarelli_c/lastname_f/DEMO/2025-03-01_WGS_Nanopore_Sample1_Variants.vcf"
├── `0004_ngs_processesed/quintarelli_c/lastname_f/DEMO/2025-03-01_WGS_Nanopore_Sample2_Variants.vcf"
├── `0004_ngs_processesed/quintarelli_c/lastname_f/DEMO/2025-03-01_WGS_Nanopore_Full_Cohort_Variants.sqlite"
STEP 3 METADATA Versions
Update the metadata in the same analysis directory, whenever possible
0004_ngs_analysis/
├── quintarelli_c/
├── lastname_f/
├── DEMO/
├── 2025-01-01_RNAseq_Bulk_DEMO
├── 1st Update cleaned metadata
├── '2025-05-01_DEMO_RNAseq_Bulk_sample_annotations.csv'
├── 2nd Update cleaned metadata
├── '2025-10-01_DEMO_RNAseq_Bulk_sample_annotations.csv'
Then save the output in the same 0002_metadata directory for the project
0002_metadata/
├── quintarelli_c/
├── lastname_f/
├── DEMO/
├── 2025-01-01_DEMO_RNAseq_Bulk_sample_annotations.csv
├── 2025-05-01_DEMO_RNAseq_Bulk_sample_annotations.csv
├── 2025-10-01_DEMO_RNAseq_Bulk_sample_annotations.csv
STEP 4 FINAL INPUTS TO DOWNSTREAM DATA ANALYSIS
Similary, save the FINAL publication ready versions of genomic data analyses, such as gene quantification matrices, VCF files, rna-seq fusion outputs in 0003_ngs_processesed directory.
0003_ngs_processesed/
├── quintarelli_c/
├── lastname_f
├── DEMO
├── 2025-02-01_DEMO_RNAseq_Bulk_raw_counts.csv
├── 2025-03-01_WGS_Nanopore_Sample1_Variants.vcf
├── 2025-03-01_WGS_Nanopore_Full_Cohort_Variants.sqlite"