Title

01. Whole Genome Synteny

We built whole-genome alignments for 67 species (55 ruminants and 12 mammalian outgroup species) with LAST v88555 and MULTIZ, using the goat assembly (ARS1) as the reference.

step 1: lastdb -uNEAR -cR11 ref.fa.db ref.fa
step 2: lastal -P20 -m100 -E0.05 ref.fa.db query.fa | last-split -m1 > query.maf
step 3: maf-swap query.maf | last-split | maf-swap | last-split | maf-sort > query.LAST.maf
step 4: perl maf.rename.species.S.pl query.LAST.maf refSpeciesName querySpeciesName query.Final.maf > query.stat
step 5: multiz query1.Final.maf query2.Final.maf 0 all > query1.query2.mutiz.maf 

Get perl script: maf.rename.species.S.pl

02. Gene Annotation Improvement

We aligned goat whole genome (ARS1) to the sheep, cattle, and human genome respectively to generate pairwise alignments using LAST v88555. Then gene coordinates of these three species were converted to the goat genome using LiftOver tool to improve goat gene annotation.

step 1: lastdb -uNEAR -cR11 ref.fa.db ref.fa
step 2: lastal -P20 -m100 -E0.05 ref.fa.db query.fa | last-split -m1 > query.maf
step 3: maf-swap query.maf | last-split | maf-swap | last-split | maf-sort > query.LAST.maf
step 4: maf-convert psl query.LAST.maf > query.psl
step 5: axtChain -linearGap=loose -psl query.psl ref.2bit query.2bit queryToRef.chain
step 6: liftOver -minMatch=0.9 -genePred ref.gene.genePred queryToRef.chain query.gene.map query.gene.unMapped (for sheep and cattle)
        liftOver -minMatch=0.2 -minBlocks=0.5 -genePred ref.gene.genePred queryToRef.chain query.gene.map query.gene.unMapped (for human)

03. GO/Pathway Annotation

A synteny alignment approach was applied to identify the orthologous genes of each ruminant genome. The consensus mRNA and coding sequences of each species were extracted based on the goat and cattle annotation file (require Mysql database). Then Gene Ontology (GO), KEGG pathways and wikiPathways of cattle (with well-annotated gene annotation) were reciprocal mapping back to the orthologous genes to obtain ruminant orthologous GO and pathway annotation.

step 1: mafGene Goat 67speciesMultiz -useFile genePred species.list 67species.orthologous.mRNA.output -noTrans
step 2: mafGene Goat 67speciesMultiz -useFile genePred species.list 67species.orthologous.protein.output

04. Conservation Scores

Both phastCons and phyloP conservation scores have been computed separately for three groups of organisms: Bovidae (39 species), Pecora (54 species) and Ruminatia (55 species).

[generate non-conservative and conservative models]

step 1: phyloFit --tree tree.nwk --msa-format FASTA --out-root nonconserved-4d.mod 4DTV.fasta
step 2: msa_split chr.maf --in-format MAF --windows 100000,0 --out-root split_maf/chr --out-format SS --min-informative 1000 --between-blocks 5000
step 3: phastCons --estimate-rho window.ss --no-post-probs window.ss nonconserved-4d.mod.mod
step 4: phyloBoot --read-mods window.cons.list --output-average Average.chr.cons.mod
        window.cons.list: 6.10000001-10100000.ss.cons.mod,6.1000001-1100000.ss.cons.mod,6.100001-200000.ss.cons.mod,6.100099918-100200000.ss.cons.mod,6.100200001-100300000.ss.cons.mod ...

[ phastCons ]

phastCons --most-conserved phastCons.chr.bed --score chr.maf Average.chr.cons.mod,nonconserved-4d.mod.mod > phastCons.chr.wig

[ phyloP ]

phyloP --wig-scores --mode CONACC --method LRT nonconserved-4d.mod.mod chr.maf > phyloP.chr.wig

05. RNA-seq Analysis Pipeline

Pipeline: RNA-seq data with a reference genome

step 1: [Quality_control]
        Single-read: java -Xmx30g -jar trimmomatic-0.36.jar SE -phred33 -threads 10 sample.fq.gz sample.clean.fq.gz LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:40 TOPHRED33 > sample.log
        Paired-end: java -Xmx30g -jar trimmomatic-0.36.jar PE -phred33 -threads 10 sample_1.fq.gz sample_2.fq.gz sample_1.clean.fq.gz sample_1.unpaired.fq.gz sample_2.clean.fq.gz sample_2.unpaired.fq.gz LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:40 TOPHRED33 > sample.log
step 2: [STAR_index]
        STAR --runMode genomeGenerate --genomeDir index --genomeFastaFiles ref.fa --runThreadN 24 --sjdbGTFfile ref.gtf --sjdbOverhang 149
step 3: [Hisat2_index]
        hisat2-build ref.fa ref.hisat2.index
step 4: [Splice_sites]
        python3.5 hisat2_extract_splice_sites.py ref.gtf > ref.hisat.splicing.site
step 5: [Mapping]
        STAR --genomeDir $INDEX --readFilesIn $IN/${i}_1.clean.fq.gz $IN/${i}_2.clean.fq.gz --readFilesCommand zcat --runThreadN 8 --outFilterMultimapNmax 1 --outFilterIntronMotifs RemoveNoncanonical Unannotated --outFilterMismatchNmax 10 --outSAMstrandField intronMotif --outSJfilterReads Unique --outSAMtype BAM Unsorted --outReadsUnmapped Fastx --outFileNamePrefix $OUT/${i}
        hisat2 --dta-cufflinks --no-mixed --no-discordant -p 10 --known-splicesite-infile ref.hisat.splicing.site -x ref.hisat2.index -1 $OUT/${i}Unmapped.out.mate1 -2 $OUT/${i}Unmapped.out.mate2 -S $OUT/${i}.sam
        java -Djava.io.tmpdir=$TMP -jar picard.jar CleanSam I=$OUT/${i}Aligned.out.bam O=$OUT/${i}.STAR.bam
        java -Djava.io.tmpdir=$TMP -jar picard.jar CleanSam I=$OUT/${i}.sam O=$OUT/${i}.hisat.bam
        java -Djava.io.tmpdir=$TMP -jar picard.jar MergeSamFiles I=$OUT/${i}.STAR.bam I=$OUT/${i}.hisat.bam SORT_ORDER=coordinate O=$BAM/${i}.bam
step 6: [Assemble_transcripts]
        stringtie sample.bam -G ref.gff -c 3 -j 2 -f 0.05 -o sample.gtf -p 16
        stringtie --merge -G ref.gff -o merge.gff allSample.gtf.list -g 100 -m 200 -c 0.8
        stringtie -e -B -p 8 -G merge.gff -o sample.gtf sample.bam
step 7: FPKMs (Fragments Per Kilobase per Million mapped reads) of these transcripts and genes in each sample were computed by Ballgown (Version 2.2.0)

06. Epigenome Analysis Pipeline

Pipeline: ChIP-seq data processing

step 1: [BWA mappping]
        Index: bwa index ref.fa
        Single-read: bwa mem ref.fa -t 8 -M -R '@RG\tID:sample\tLB:sample\tPL:ILLUMINA\tSM:sample' sample.fastq | samtools view -bS -@ 8 > sample.bam
        Paired-end: bwa mem ref.fa -t 8 -M -R '@RG\tID:sample\tLB:sample\tPL:ILLUMINA\tSM:sample' sample_1.fastq sample_2.fastq | samtools view -bS -@ 8 > sample.bam
step 2: [Merge, sort, and filter bam]
        java -Djava.io.tmpdir=/tmp/ -jar picard.jar MergeSamFiles I=sample_line1.bam I=sample_line2.bam I=sample_line3.bam SORT_ORDER=coordinate O=sample.bam
        samtools sort -@ 8 -o sample.sorted.bam -T sample.sorted.tmp -O bam sample.bam
        samtools view -q 20 sample.sorted.bam -b -o sample.sorted.unique.bam
        samtools index sample.sorted.unique.bam
step 3: [Predict fragment size]
        python2.7 macs2 predictd -i sample.sorted.unique.bam -f BAM -g genomeSize --outdir sample/
step 4: [Call peak]
        python2.7 macs2 callpeak -t sample.sorted.unique.bam -c control.sorted.unique.bam --extsize fragmentSize --nomodel -f BAM -g genomeSize -n sample -p 1e-5 -B --outdir sample_pe-5/
step 5: [LiftOver]
        liftOver sample.narrowPeak refToGoat.chain sample.goatRef.narrowPeak sample.unmap -minMatch=0.2

Pipeline: ATAC-seq data processing

step 1: [Mapping]
        bowtie2-build ref.fa ref.bowtie2.index
        bowtie2 -x ref.bowtie2.index -X 2000 -1 sample_1.fq.gz -2 sample_2.fq.gz -S sample.sam
        samtools view -bS sample.sam > sample.bam
step 2: [Sort, filter, and remove duplicates and mitochondria from bam]
        samtools sort -@ 8 -o sample.sorted.bam -O bam sample.bam
        samtools view -q 20 sample.sorted.bam -b -o sample.sorted.unique.bam
        java -Xmx50g -Djava.io.tmpdir=tmp -jar picard.jar MarkDuplicates INPUT=sample.sorted.unique.bam OUTPUT=sample.sorted.unique.dedup.bam METRICS_FILE=sample_dedup REMOVE_DUPLICATES=true CREATE_INDEX=true ASSUME_SORTED=true VALIDATION_STRINGENCY=LENIENT MAX_FILE_HANDLES=2000
        samtools view -h sample.sorted.unique.dedup.bam |awk '{if($3 != "Mt_name"){print $0}}' |samtools view -Sb - > sample.sorted.unique.dedup.noMT.bam
step 3: [Call peak]
        python2.7 macs2 callpeak -t sample.sorted.unique.dedup.noMT.bam --nomodel --shift -37 --extsize 73 -f BAM -g genomeSize -n sample -q 0.05 --outdir sample_q0.05/ --broad