Uploader

Data upload is the first step in the analysis. Uploader is used to upload different data types on the platform.

Overview
UPLOAD SEQUENCE DATA
UPLOAD REFERENCES
UPLOAD ANNOTATIONS
UPLOAD METADATA

Overview

Data upload is the first step in the analysis. Generally, four data types are required:

1. 1. 1. Sequencing data (read files)
    2. Reference genomes
    3. Annotation files
    4. Metadata

Click the icon in the Utilities menu to access the Uploader window. Each data type is uploaded from a separate tab and follows the following three high-level steps:

1. 1. 1. 1. Define the data types with correct tags
        
        Verify and check for its existence in users account, format, and name conflicts
        
        Upload

Pre-configured references and annotation files are available for the model organisms.

FigData upload interface - each tab allows uploading of different data types to the data store. A: Sequence data B: References C: Annotations D: Metadata..

UPLOAD SEQUENCE DATA

Sequencing files can be uploaded through the SEQUENCE DATA (Fig. 1) tab. Complete all the fields in the form before selecting the files.

MatePair: Select appropriate mate-pair information, if available. This is an optional parameter (Default: No).
Strand: Select appropriate strand information, if available. This is an optional parameter (Default: No).
Sequencing Platform*: Select the sequencing instrument used to generate the data (check with the provider).
Organism*: Select the organism from the drop-down menu.
- HINT - Select Other if the organism is not listed.
Tag*: Provide a tag name for the sample files that can be used to filter the data later.

Caution - Tags should contain alphanumeric characters only.

The user can select sample files through the selection dialog box.

HINT - Currently allowed data file formats are fq, fastq, bam, sam, ubam, cram, hdf5, and their zipped versions.

Ensure that each sample file is compressed separately with the same name as the name of the sample file itself

HINT - Sequencing data from other sources (NCBI SRA) should be converted to fastq/fq format.

HINT - If sequencing is multiplexed, demultiplex the data before uploading.

Click to verify file types, check file existence in the user's account, and pair files (not applicable to Single-End data). File pairing information (i.e. forward/reverse sample) is retained for PE sample data.

Pairing is done if forward and reverse samples have the suffix combinations: _1/_2; _F/_R; _f/_r; -1/-2; -F/-R; -f/-r; _R1/_R2.

Upon confirming the Data Store check, click to begin the file transfer (the progress bar indicates the upload status for each file). Following four actions are performed (failed action errors will be displayed, if any) after a successful upload.
- - Uncompression of the files
  - Data quality check with FastQC
  - Registration into the SEQUENCE DATA
  - Status notification through email
Go to the SEQUENCE DATA to access the uploaded files and quality report (Fig. 2). Occasionally, files are not visible instantly - a window refresh is required.
Click on a sample name to access its additional details.
Use the icon on the upper right corner of the sample details window to delete samples.

UPLOAD REFERENCES

The latest versions of reference genomes and transcriptomes are available on the platform for eight model organisms. Pre-configured references can be explored through REFERENCES (Fig.). The owner column helps to identify the pre-configured references (owned by Stanome) from the custom references (owned by the users).

Click the REFERENCES tab on the Upload window to upload genome and transcriptome files.

Fig. 1. REFERENCE FILES upload window. Fields shown in red are mandatory.

Complete all the fields in the form before selecting the files.

Add to Existing: Helps to add additional files such as transcriptome/genome. to an existing genome/transcriptome. This is an optional parameter (Default: No).
Organism*: Select the organism name.
Version*: Provide Genome Build (version) name or the number of the reference file(s). This will help to select the correct reference version during analysis.

HINT - Allowed formats for genome or transcriptome files are fa, fasta, fna, and their compressed formats. Each file should be compressed separately.

Following actions are performed on the uploaded references:

Uncompression of the files
Validation of the format and integrity of the file
Registration into the REFERENCES
Status notification through email

Successfully uploaded reference files are stored in the REFERENCES. Custom references can be deleted using on the reference details window.

UPLOAD ANNOTATIONS

Pathways, gene ontology (GO) terms, ABR genes, and VEP (Variant Effect Predictor) files are classified as annotations and can be uploaded through the ANNOTATIONS (Fig.) tab.

ABR and GO_OBO upload has two fields:

Organism: Name of the organism
Tag: The version number of the annotation file

Gene Model, VEP, and Variations upload have three fields:

Organism: Name of the organism
Reference version: Version of the reference file
Tag: Version of the annotation file

GO and Pathway upload has four fields:

Organism: Name of the organism
Reference version: Version of the reference file
GTF version: Version of the gene model file
Tag: Version

Different types of annotations and their allowed formats are explained in Table below.

Tag	Data Type	Format	Details
Gene models	Gene annotations	Gtf, Gff3	Source: Ensembl. Should correspond to the Ensembl genome versions
Pathway	Pathways	Tab, Gmt	Source: Wiki pathways
Gene Ontology	Gene Ontology associations	Obo, Tab, Txt	GO terms
VEP	Variant annotations	Custom	Source: Ensembl. Should correspond to the Ensembl genome versions
Variations	GATK variants	VCF	Source: Ensembl

Table. Annotation file formats. Compressed files are allowed as long as each file is compressed separately.

Fig. 1. ANNOTATION FILES upload window. One of the radio icons needs to be selected to upload the data.

Similar to Sequence and reference file uploads, the following actions are performed on the uploaded annotation files:

Uncompression of the files
Validation of the format and integrity of the file and its compatibility to genome/gene annotation file.
Registration into the ANNOTATIONS
Status notification through email

Successfully uploaded files are stored in ANNOTATIONS and can be accessed while executing the pipelines.

UPLOAD METADATA

Any files associated with an experiment (excluding sequencing files, references, and annotations) can be uploaded through the METADATA (Fig.) tab. During upload, the metadata files should always be associated with an organism and tagged appropriately, as explained in Table below.

Data Type	Tag Name	Format	Details
List of genes	Gene List	Tab, CSV, TXT	Ensembl Gene IDs only
Target markers	Hotspots	BED, VCF	SNPs, MNPs, INDELs
Amplicon ranges	Amplicon Range	BED, VCF	Target region with start and ends
Variants/genotypes	Genotypes	BED, VCF	Called variants

Table. Metadata file formats and associated upload tags.

Similar to Sequence and reference file uploads, the following actions are performed on the uploaded metadata files:

- Uncompression of the files
- Validation of the format and integrity of the file and its compatibility to the reference genome.
- Registration into the METADATA
- Status notification through email

Fig. 1. METADATA FILES upload window.

Successfully uploaded metadata files are stored in METADATA and can be accessed while executing the pipelines.