Title

Documentation

General

The Ruminant Genome Database (RGD) is a comprehensive repository of integrated genomics and expression data of ruminantia, a suborder which includes economically important livestock cow, water buffalo, yak, sheep and goat. The RGD built a feature-rich UCSC Genome Browser to the local server, and implements open-source software and MySQL databases providing access to all the data contributed by our Ruminant Genome Project (RGP) and other public database. Today, RGD contains data of 52 ruminant genomes, gene and repeat annotation, whole genome alignments with 52 ruminants and 13 mammals, 6 scales conservative evaluation, 361 runs of RNA-seq data, as well as the biological knowledge of each ruminant family. Using the best assembled goat (ARS1) as the reference genome, the comparative genomic data can be viewed and downloaded by the local UCSC Genome Browser. The GO term, KEGG pathway, and gene expression heatmap can be queried with gene symbol. RGD also provides online alignment tools such as webBlat and NCBI’s wwwBLAST server for goat (ARS1) assembly. RGD will provide good data visualization for researchers to investigate evolutionary, ecological and economic traits issues.

Methods

1. Cross-species Annotation

Firstly, we aligned goat whole genome (GCA_001704415.1) to the sheep (GCA_000298735.2) or cattle genome (GCA_000003055.3 and GCA_000003055.5) producing pairwise alignments by LAST v88555. Then gene positions of sheep or cattle were converted to goat genome using LiftOver.

step 1: lastdb -uNEAR -cR11 ref.fa.db ref.fa
step 2: lastal -P20 -m100 -E0.05 ref.fa.db query.fa | last-split -m1 > query.maf
step 3: maf-swap query.maf | last-split | maf-swap | last-split | maf-sort > query.LAST.maf
step 4: maf-convert psl query.LAST.maf > query.psl
step 5: axtChain -linearGap=loose -psl query.psl ref.2bit query.2bit queryToRef.chain
step 6: liftOver -minMatch=0.9 -genePred ref.gene.genePred queryToRef.chain query.gene.map query.gene.unMapped
2. 65-way Multiz Align

We built whole-genome alignments for 65 species (52 ruminants and 13 mammals) with LAST v88555 and MULTIZ, using the goat ARS1 assembly as the reference.

step 1: lastdb -uNEAR -cR11 ref.fa.db ref.fa
step 2: lastal -P20 -m100 -E0.05 ref.fa.db query.fa | last-split -m1 > query.maf
step 3: maf-swap query.maf | last-split | maf-swap | last-split | maf-sort > query.LAST.maf
step 4: perl maf.rename.species.S.pl query.LAST.maf refSpeciesName querySpeciesName query.Final.maf > query.stat
step 5: multiz query1.Final.maf query2.Final.maf 0 all > query1.query2.mutiz.maf 

Get perl script: maf.rename.species.S.pl

3. Conservation scores

Both phastCons and phyloP conservation scores have been computed separately for four groups of organisms: 8 Cervidae alone, 37 Bovidae, 48 Pecora and 51 Ruminatia.

[generate non-conservative and conservative models]

step 1: phyloFit --tree tree.nwk --msa-format FASTA --out-root nonconserved-4d.mod 4DTV.fasta
step 2: msa_split chr.maf --in-format MAF --windows 100000,0 --out-root split_maf/chr --out-format SS --min-informative 1000 --between-blocks 5000
step 3: phastCons --estimate-rho window.ss --no-post-probs window.ss nonconserved-4d.mod.mod
step 4: phyloBoot --read-mods window.cons.list --output-average Average.chr.cons.mod
        window.cons.list: 6.10000001-10100000.ss.cons.mod,6.1000001-1100000.ss.cons.mod,6.100001-200000.ss.cons.mod,6.100099918-100200000.ss.cons.mod,6.100200001-100300000.ss.cons.mod ...

[ phastCons ]

phastCons --most-conserved phastCons.chr.bed --score chr.maf Average.chr.cons.mod,nonconserved-4d.mod.mod > phastCons.chr.wig

[ phyloP ]

phyloP --wig-scores --mode CONACC --method LRT nonconserved-4d.mod.mod chr.maf > phyloP.chr.wig
4. RNA-seq

Pipeline: RNA-Seq data with a reference genome

step 1: [Quality_control]
        java -Xmx30g -jar trimmomatic-0.36.jar PE -threads 10 sample_1.fq.gz sample_2.fq.gz sample_1.clean.fq.gz sample_1.unpaired.fq.gz sample_2.clean.fq.gz sample_2.unpaired.fq.gz LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:40 TOPHRED33 > sample.log
step 2: [STAR_index]
        STAR --runMode genomeGenerate --genomeDir index --genomeFastaFiles ref.fa --runThreadN 24 --sjdbGTFfile ref.gtf --sjdbOverhang 149
step 3: [Hisat2_index]
        hisat2-build ref.fa ref.hisat2.index
step 4: [Splice_sites]
        python3.5 hisat2_extract_splice_sites.py ref.gtf > ref.hisat.splicing.site
step 5: [Mapping]
        STAR --genomeDir $INDEX --readFilesIn $IN/${i}_1.clean.fq.gz $IN/${i}_2.clean.fq.gz --readFilesCommand zcat --runThreadN 8 --outFilterMultimapNmax 1 --outFilterIntronMotifs RemoveNoncanonical Unannotated --outFilterMismatchNmax 10 --outSAMstrandField intronMotif --outSJfilterReads Unique --outSAMtype BAM Unsorted --outReadsUnmapped Fastx --outFileNamePrefix $OUT/${i}
        hisat2 --dta-cufflinks --no-mixed --no-discordant -p 10 --known-splicesite-infile ref.hisat.splicing.site -x ref.hisat2.index -1 $OUT/${i}Unmapped.out.mate1 -2 $OUT/${i}Unmapped.out.mate2 -S $OUT/${i}.sam
        java -Djava.io.tmpdir=$TMP -jar picard.jar CleanSam I=$OUT/${i}Aligned.out.bam O=$OUT/${i}.STAR.bam
        java -Djava.io.tmpdir=$TMP -jar picard.jar CleanSam I=$OUT/${i}.sam O=$OUT/${i}.hisat.bam
        java -Djava.io.tmpdir=$TMP -jar picard.jar MergeSamFiles I=$OUT/${i}.STAR.bam I=$OUT/${i}.hisat.bam SORT_ORDER=coordinate O=$BAM/${i}.bam
step 6: [Assemble_transcripts]
        stringtie sample.bam -G ref.gff -c 3 -j 2 -f 0.05 -o sample.gtf -p 16
        stringtie --merge -G ref.gff -o merge.gff allSample.gtf.list -g 100 -m 200 -c 0.8
        stringtie -e -B -p 8 -G merge.gff -o sample.gtf sample.bam
step 7: FPKMs (Fragments Per Kilobase per Million mapped reads) of these transcripts and genes in each sample were computed by Ballgown (Version 2.2.0)

Pipeline: RNA-Seq data without a reference genome

step 1: [de novo transcriptome assembly]
        sample_1.clean.fq.gz | head -1000000 > 00_fastQC/QC_result/sample_1.clean.test.fq; fastqc 00_fastQC/QC_result/sample_1.clean.test.fq -extract -o 00_fastQC/QC_result (see per_base_quality.png, to define -ta -tb -tc -td in next step)
        perl get_clean_shell.pl -q 33 -t 8 -s insert_size.txt -ta 2 -tb 10 -tc 2 -td 10 -clean -i infile.txt -o ./clean_data_rest (generate sample_1.clean.fq.clean)
        perl Bridger.pl --seqType fq --left sample_1.clean.fq.clean --right sample_2.clean.fq.clean --CPU 20 --output bridger_jg
step 2: [CD-HIT: removing redundant or highly similar sequences]
        cd-hit-est -i ./7_Bridger.fasta -o ./7_Bridger.nr.fasta -M 10000 -c 0.9 -n 8 -T 20
step 3: [calculate expression]
        trinityrnaseq-Trinity-v2.4.0/align_and_estimate_abundance.pl --seqType fq --left sample_1.clean.fq.gz --right sample_2.clean.fq.gz --transcripts 7_Bridger.fasta --output_prefix EG --est_method RSEM  --aln_method bowtie --output_dir ./EG.RSEM   (generate SD.isoforms.results)
        abundance_estimates_to_matrix.pl --est_method RSEM --out_prefix species_trans \
        trinity/RSEM/BW.RSEM/BW.isoforms.results \
        trinity/RSEM/EG.RSEM/EG.isoforms.results \
        trinity/RSEM/F.RSEM/F.isoforms.results \
        trinity/RSEM/G.RSEM/G.isoforms.results \
        trinity/RSEM/GW.RSEM/GW.isoforms.results
step 4: [annotation by diamond blastx]
        diamond blastx --query 7_Bridger.fasta --db goat.protein --outfmt 6 --threads 20 --out species.fasta.blastx.goat.result --evalue 1e-5 --max-target-seqs 1

Manual

1. Phylogenetic tree

We used whole genome alignments of 52 Ruminantia genomes to construct high resolution phylogenetic tree. Four important species diversion events are coincided with global climate changes from 65 Mya, including the first emerged ruminant species, the radiational diversion of Pecora, the radiational diversion of subfamilies and the diversion of genera. Other datasets, such as fourfold degenerate sites (4dTv), coding sequence of genes (CDS), highly conserved elements (HCE), and mitochondrial complete genomes also support this robust phylogenic tree of ruminants.

Fig 1. The most consensus phylogenetic tree of Ruminantia.

The branches of six main families are marked in different colors.

2. Families

User can get biological knowledge of six ruminant families from RGD, including distribution, morphology, and niche.

3. Basic search

3.1 Gene Basic Information and GO/KO Annotation

Users can search with a gene symbol and select one of the above species to get results from three aspects:

(1) "Gene basic information"
    containing Chrom, Start, End, Strand, CDS Length, Exon Number, Transcript ID, Gene Biotype and Transcript
    Product. When selected species is goat, user can click button of Gbrowse to jumping the local UCSC genome
    browser. Additionally, NCBI Gene link is also provided near the title to help get more information such as
    gene sequence, mRNA sequence and protein sequence.
(2) "Gene Ontology"
    This aspects provides relevant GO ID and GO terms which can be linked to AmiGO 2 database.
    The Gene Ontology is only provided for cattle.
(3) "KEGG pathway"
    In addition to mouflon, the remaining six species (goat, sheep, cattle, wild yak, zebu and chiru) provide
    gene pathways which is linked to KEGG database. User can get KO, pathway, motif, protein sequence, nucleic
    acids sequence and etc through this KEGG link.

3.2 Gene Expression Heatmap

Users can input the gene symbol in the search box and choose one species getting a heatmap with different tissues.

(1) Goat [87 samples]:
    10 tissues at different stages are displayed, especially for rumen at 10 time-points from birth to six month.
(2) Sheep [206 samples]:
    49 tissues at different stages covering nervous and reproductive system, digestive system, immune system,
    endocrine system, muscle, adipose tissue and others are shown in one heatmap, so user can easily find which
    tissue has high abundance.
(3) Cattle [68 samples]:
    We offer eight corresponding tissues (heart, liver, lung, kidney, ear, muscle, adipose and brain) for adult
    cattle and yak, which is convenient for people to compare the differences between two species.
4. Tools

4.1 Local UCSC Genome Browser

User can type chromosome position, or gene symbol, or transcript name in the search box and click the "go" button to visualize ruminant data in the genome browser. Since comparative genomics data is displayed using goat as reference genome and goat gene annotations are not well-developed, we added three tracks of cross-species gene annotation with "Sheep NCBI Genes", "Cattle NCBI Genes" and "Cattle Ensembl Genes" (Fig2, Section 1). This browser also displays an "expreBar" bar graph (Fig2, Section 2) for goat RNA-seq data. A box plot showing the range of expression levels across samples is displayed on the details page when user click the "expreBar" track. Moreover, 65-way Multiz Align (Fig2, Section 4) and six scales measurements of evolutionary conservation (Fig2, Section 3) were also in the genome browser. User can find the differences among species by zooming in the alignments to base level. And the protein encoding sequence of 65-way alignments can be translated into amino acids. The comparative genomics data can be used as guides to help locate protein-coding genes, identify functional elements, reveal similarities and differences among species, and trace the course of evolution, which provides more useful information than a single individual genome. User can click "PDF/PS" item under the "View" menu of navigation bar to generate a high quality image in PostScript or PDF formats.

Figure 2. Main Genome Browser display page on the goat ARS1 assembly, showing NCBI genes, expreBar, 10 conservation score tracks, and 65-way multiple alignment track.

4.2 Local UCSC Table Browser

UCSC Table Browser is a powerful tool for retrieving raw data and performing intersections and unions between data in different tracks. For the basic data queries, user can select clade, genome, assembly, group, track, table, regions of interest, output format and output file name to get query results in a tab-delimited text format or compressed format. While for the advanced queries, user can filter and refine queries, intersect query results from different tables and configure the resulting output. The UCSC Table Browser can retrieve and download all data from tracks of goat genome for other analysis.

4.3 Blat

webBlat is a web-based version of BLAT developed by Jim Kent. User can type a DNA or protein sequence against the goat ARS1 assembly to return a list of links to all genome positions that share 95% or greater identity with the input sequence. Then the alignment regions can be displayed in the genome browser when user click the "browser" link.

4.4 Blast

NCBI's wwwBLAST is also available in our database as an online tool, which can be found under the "Tools" menu. User can enter query sequences of DNA or protein against goat genome database getting a group of high-scoring pairwise alignments.

Download

   Ruminant Genome and Annotation Data
http://animal.nwsuaf.edu.cn/Ruminantia/Download/Genome
   65-way Multiz Align and conservation scores
http://animal.nwsuaf.edu.cn/Ruminantia/Download/65WayAlign
   RNA-seq Data
http://animal.nwsuaf.edu.cn/Ruminantia/Download/RNA-seq