Title

Documentation

General

The rapid advancement of next-generation sequencing technology has generated a deluge of genomic data from domesticated goats and their wild ancestor, bezoars, which have simultaneously broadened our understanding of domestication and subsequent intensive trait-driven breeding. To address the scarcity of SNPs and indels data provided by authorized databases and to make these data more easily and friendly usable and available, we developed a comprehensive Goat Genome Variation Database (GGVD) that provides five main functionalities: Variation Search, Genomic Signature Search, Genome Browser, Alignment Search Tools (BLAT/BLAST) and Genome Coordinate Conversion Tool (LiftOver). The GGVD provides genomic variations comprising ~41.44 M SNPs, ~5.14 M indels, and signatures of selective sweeps in 232 worldwide goats. Users can quickly retrieve an interactive table and a visual map that project the allele frequency distribution pattern in 20 geographically distributed goat populations and eight subgroups according to PCA and NJ tree. The signals of selection are displayed in the common formats of Manhattan plots, line charts, and Genome Browser tracks, and p-values are provided for each window. User can compare the selection differences between different groups by any combination. Genome Browser integrate all variations and selection data for global presentation. Collectively, all these features make the GGVD a useful archive for in-depth analysis in goat biology and goat breeding.

Related articles:

1. Zheng Z., Wang X., Li M., Li Y., ... Chen Y & Jiang Y. (2020) The origin of domestication genes in goats. ence Advances 6, eaaz5216.
2. Cai Y., Fu W., Cai D., Heller R., Zheng Z., ... Jiang Y & Wang X. (2020) Ancient genomes reveal the evolutionary history and origin of cashmere producing goats in China. Molecular biology and evolution, msaa103, https://doi.org/10.1093/molbev/msaa103.

Methods

I. Pipeline of SNPs and indels calling
  1. All raw sequencing data were obtained from Sequence Read Archive (SRA) of NCBI. We collected a total of 232 worldwide goat samples. The genome resequencing achieved an average depth of ~14X.
  2. All cleaned reads were mapped to the latest goat reference genome ARS1 (GCF_001704415.1) using BWA-MEM v0.7.15 with default parameters. Duplicate reads were removed using Picard v2.1 (http://broadinstitute.github.io/picard/). Then, both Genome Analysis Toolkit (GATK) HaplotypeCaller (version 3.7) with default parameter and Analysis of next generation Sequencing Data (ANGSD, version 0.918) with "-only_proper_pairs 1 -uniqueOnly 1 -remove_bads 1 -minQ 20 -minMapQ 30 -C 50 - doMaf 1 -doMajorMinor 1 -GL 1 -setMinDepth Total_read_depth/3 -setMaxDepth Total_read_depth*3 -doCounts 1 -dosnpstat 1 -SNP_pval 1" were implemented to detect the SNPs in all samples. Only biallelic sites detected with both methods with a quality score greater than 50 and no more than 10% missing data were used.
  3. For indel calling, we first sifted structural variations for each sample by GATK with the SelectVariants-based method. Then, we applied the hard filter command "VariantFiltration" to exclude potential false-positive variant calls with the parameter –filterExpression "QD < 2.0 || FS > 200.0 || ReadPosRankSum < −20.0 || InbreedingCoeff < −0.8". Finally, we only retained the 1–30 bp indels for downstream analysis.
  4. A total of ~41.44 million autosomal SNPs and ~5.14 million autosomal indels were identified.
  5. Beagle software was used to phase the identified SNPs in goat.
  6. Annotation of SNPs and indels was carried out by using snpEff.
  7. Minor allele frequencies (MAF) for all goats, and allele frequencies for each goat group were calculated with PLINK and VCFtools.
II. Population structure
  1. The PCA of the SNPs was performed using the smartpca programme in EIGENSOFT v5.0. The Tracy-Widom test was used to determine the significance level of the eigenvectors. For phylogenetic analysis, a neighbour-joining tree was constructed with RapidNJ with 100 bootstraps.
  2. Combining our previous result, eight geographically distributed genetic components can be roughly ascribed to: Bezoars, African goats, African dairy goats, European goats, Australian goats, Southwest Asian goats, South Asian goats, and East Asian goats.
III. Selection evaluation
  1. GGVD provides nucleotide diversity (Pi), heterozygosity (Hp), Tajima's D, composite likelihood ratio (CLR), integrated haplotype score (iHS), Weir and Cockerham’s Fst (FST), cross-population extended haplotype homozygosity (XP-EHH), and differences in nucleotide diversity (Pi ratio) for 10 goat groups. To facilitate the identification of true selective signatures, we set a cutoff corresponding to Z test P < 0.005 if the data conforms to the normal distribution or use top 1% signal value when the data does not satisfy the normal distribution.
IV. liftOver chain file
  1. We aligned different versions of the goat genome separately by minimap2, producing six pairwise alignment results.
  2. Two utilities, maf-convert and axtChain, were used to produce liftOver chain files to facilitate users to coordinate conversion.
V. Database implementation
  1. High-quality SNPs, indels, selection scores and their corresponding annotations, classification and threshold value, were processed with Perl scripts and stored in the MySQL database.
  2. We use PHP Server Pages, HTML5 and JavaScript to implement search, data visualization and download.

Manual

I. Samples and population structure

Our database integrates resequencing data of 232 goat samples from published genetic works. The data set contains six geographic groups: 24 Bezoars, 70 African goats, 31 European goats, 5 Australian goats, 36 Southwest Asian goats, 9 South Asian goats, and 57 East Asian goats. A whole-genome neighbor-joining phylogenetic tree demonstrated a clear genetic structure with samples from each geographical region clustering together except African goats. Due to African dairy goats were introduced from Europe, so they clustered with European goats to form one branch. According to genetic relationship, all the samples can be roughly ascribed to: Bezoars, African goats, African dairy goats, European goats, Australian goats, Southwest Asian goats, South Asian goats, East Asian goats, and Tibetan goats.

 

Fig 1. Geographic distribution, population genetics, and data processing pipeline of 232 goat individuals.

 
II. Variation search

The GGVD allows users to obtain information of SNPs and indels by searching for a specific variant rsid, a gene symbol or a genomic region. Users can filter SNPs and indels further by "Advanced Search", in which some parameters, such as minor allele frequency and consequence type, can be set; this option enables users to narrow down the items of interest in an efficient and intuitive manner. The results are presented in an interactive table and graph. Based on the returned results, users can obtain related details including variant position, alleles, minor allele frequency, variant effect, rs id and the allele frequency distribution pattern in 20 geographically distributed goat populations or eight subgroups according to genetic structure.

  SNPs or indels search

Fig 2. Screenshots of a SNP data search and the results for examples.

III. Signature search

For basic search, users can select a specific gene symbol or genomic region, one of the statistical methods (Pi, Hp, Tajima's D, CLR, iHS, FST, Pi ratio, XP-EHH), and a specific goat group to view the selection scores. The results are retrieved in a tabular format. When users click the "show" button on the table, selective signals are displayed in Manhattan plots, where the target region or gene is highlighted in red colour. For advanced search, users can select groups and methods by any combination to view the selection scores in line charts, which is usually used as a partial enlarged view of selective sweep. In our database, the selection scores are pre-processed by several algorithms, such as Z-transform, logarithm, and p values are calculated according to data distribution mode.

 
  1. Signature basic search

Fig 3. Screenshots of a search for genomic selection data and representative tracks of Gbrowse.

  2. Signature advanced search

Fig 4. Selective signal advanced search and screenshots of selective sweep around FGF5 gene in different groups (chr6: 95.17-96.03 Mb).

IV. GGVD tools

1. Local UCSC genome browser

Users can search with a gene symbol, or a transcript name, or a genomic region to view SNPs, indels, genomic signature, and conserved elements in the global view. Currently, 56 tracks have been released for the goat ARS1 assembly. The "PDF/PS" item under the "View" menu of navigation bar was used to generate a high quality image in PostScript or PDF formats.

2. Alignment search tools (BLAT/BLAST)

We introduced two sequence alignment tools, webBlat and NCBI wwwBLAST. The webBlat can be used to quickly search for homologous regions of a DNA or mRNA sequence, which can then be displayed in the browser. BLAST can find regions of local similarity between sequences, which can be used to infer functional and evolutionary relationships between sequences.

3. Genome coordinate conversion tool (liftOver)

We also introduced a genome coordinate conversion Tool, liftOver. The liftOver tool is used to translate genomic coordinates from one assembly version into another and also retrieves putative orthologous regions in other species. Our database produces six liftOver chain files (ASM_vs_CHIR_1.0.minimap2.chain.gz, ASM_vs_CHIR_2.0.minimap2.chain.gz, CHIR_1.0_vs_ASM.minimap2.chain.gz, CHIR_1.0_vs_CHIR_2.0.minimap2.chain.gz, CHIR_2.0_vs_ASM.minimap2.chain.gz, and CHIR_2.0_vs_CHIR_1.0.minimap2.chain.gz) and provides an online lift for any two versions of the goat genome.

Project organizers

Yu Jiang

Northwest A&F University, Yangling, Shaanxi, China

Email: yu.jiang@nwafu.edu.cn