‘galign’ SNP and
deletion search tools (for MacOSX Intel)
The advent
of high throughput sequencing has allowed researchers to generate copious
amounts of sequence data, raising the possibility that whole genome sequencing
may become a viable method for cloning genes. One major current sequencing technology, Solexa, can
generate roughly forty million 32-36 bp sequence reads in a single run. For organisms such as C. elegans or Drosophila, this amounts to about 10x
coverage of the genome, sufficient, in principle, to guarantee that nearly
every nucleotide in the genome is sampled at least once. In reality, the vagaries of sequencing
reactions and the inability to assign reads from repetitive sequences to
specific genomic locations are likely to limit coverage to around 98-99% of the
genome.
The suite
of programs described here is intended to identify single nucleotide
polymorphisms (SNPs) or other alterations with respect to a reference genome
sequence using sequence reads of 27-39 bp obtained using Solexa sequencing.
The files
described here can be downloaded at http://shahamlab.rockefeller.edu/galign/galign0.8.zip. A beta version allowing
processing of reads upto 78 nucleotides long can be downloaded at http://shahamlab.rockefeller.edu/galign/galign0.9beta.zip. For this beta
version, convert Fastq files to 'galign' reads using Format_convert2. For alignment use Alignment_tool2.
Source codes are available at http://shahamlab.rockefeller.edu/galign/Source_codes. A paper describing 'galign' in more
detail has been submitted and is available as a preprint upon request
(shaham@rockefeller.edu).
To install, download the galign.zip file, expand the file (this is
often done automatically), and place the ‘galign’ folder on your Desktop. The ‘galign’ folder should contain the
following directory tree (Figure 1):
Program
descriptions
Following
are descriptions of each program, followed by a step-by-step protocol for
performing a search for C. elegans SNPs. These
programs have been extensively tested on a Mac OSX Version 10.6.4 2.66
Ghz Quad-Core Intel Xeon machine, with 8 GB 1066 Mhz DDR3 RAM. However, it should be straight forward to recompile them from the source code for other machines.
Alignment_tool
The alignment
tool uses two text files provided by the user to count the number of wild-type
and mutant reads at every base in the genome.
Genome
sequence file: This is a text file containing the
sequence of the genome to be evaluated in lower-case text. The file is obtained using the
Genome_assemble application (see below) and is placed in the ‘Genome_sequences’
folder in the 'galign' folder on the Desktop. The file contains a concatenated version of the genome,
starting with chromosome I with no other text besides the raw sequence. The
genome sequence file should be given a unique name. A file containing the C. elegans genome sequence (Wormbase release
195) is already placed in the ‘Genome_sequences’ folder.
Sequence
reads file: This is a text file containing the
sequence reads obtained from the Solexa runs in lower-case text. The file should be placed in the
‘Sequence_reads’ folder in the 'galign' folder on the Desktop. Since the final
several bases of Solexa sequence reads are often less reliable, reads of
32-nucleotides should be used. The
file should contain only sequence information and no other text. The file should be given a unique name. Each 32-nucleotide sequence read should
be placed on a separate line. A
tool for converting raw output from Solexa reads to the format specified here
is supplied (see ‘Format_convert’ below).
Alignment
method:
Parsing
the genome: Alignment begins with the genome sequence being read in groups
of 13 consecutive nucleotides starting at one end of the genome sequence, and
advancing one nucleotide at a time.
Each genome position is assigned to a number array (the genome position
array) whose element number is the genomic position, and whose value is a
unique numerical descriptor of the corresponding 13-nt sequence. The descriptor is a base 4
representation of the 13-mer generated by the formula , where a(i) takes on the values 0, 1, 2, or 3
if the nucleotide at position i reads G, A, T, or
C,
respectively. Next, a set of 413
number arrays (pointer arrays) are generated. Each pointer array corresponds to a possible base 4 value of
a 13-nt sequence, and the number of elements in each array corresponds to the
number of occurrences of a given 13-nt sequence in the genome. The value of each array element is the
genomic position of each 13-nt sequence.
Aligning
a wild-type sequence read: Once the genome sequence has been parsed,
'galign' loads the first sequence read from the sequence-reads file into memory
and divides it into 3 segments, A, B, and C,
of respective lengths 13, 13, and x, with 1 ≤ x ≤ 13. Generally,
sequences obtained from Illumina/Solexa reads are not accurate beyond 32
nucleotides, and the program currently uses x=6 as a default setting. Format_convert automatically generates
sequence reads of 32 nts. 'galign'
then checks whether sequence A exists in the genome, by referring to the pointer array
corresponding to sequence A. If the
sequence is found, the first genomic position of A is noted, and the program asks
whether sequences B and C
exist at positions 13 and 26 nts downstream of the genomic position of A, respectively, by referring to the
pointer arrays. If fragment C is shorter than 13 nts, as is
generally the case, the appropriate length genomic sequence is extracted from
the genome position array value.
If matches are found, a value of 1 is added to a new number array (the
wild-type array) whose element numbers correspond to positions in the genome,
and whose values represent the number of times that position has been read as
wild type.
Aligning
a sequence read with one nucleotide substitution: Single substitutions
could occur in either fragment A, B, or C. If a match to the genome with fragment A is not detected, 'galign' attempts
to match sequences B and C
next to each other in the genome.
If this is successful, the program revisits sequence A and examines whether genomic
sequences 13 nts upstream of sequence B can be matched to sequence A with one substitution. Single substitutions are detected by
subtracting the base 4 representation of the genomic sequence from that of the
sequence read and only allowing a change of a multiple of a single power of 4
(see above).
If sequence A is matched in the genome but B is not, 'galign' checks to see if
fragment C can
be exactly aligned. If C matches, sequence B is tested for having a single
substitution as above. A similar
procedure is followed for putative substitutions in fragment C.
Finally, if
single substitution matches are not found, 'galign' queries deeper elements of
the pointer arrays and repeats the process for other genomic sites containing
the A or B fragments. If no matches are detected at these
positions, 'galign' generates the reverse complement of the input fragment, and
repeats the above algorithm. If an
alignment of a single-mismatched read is made, a value of 1 is added to each
element of the wild-type array in which a sequence match was detected. A value of 1 is also added to an
element of a new number array (the mutant array) corresponding to the position
of the mismatched residue in the genome.
In addition, the nucleotide sequence of a mutant 13-nt fragment,
starting at the mutation site, and of the corresponding wild-type sequence,
deduced from the genomic sequence, are added to two new character arrays (the
mutant and wild-type sequence arrays, respectively) whose elements correspond
to the position of the mutation site in the genome.
Output: After scanning through all sequence
reads, the alignment tool generates a file containing the collected
information. The file begins at
position 1 of the first chromosome, displaying the value of the wild-type
number array for that position, followed by the value of the mutant number
array corresponding to that position.
If the mutant number array value is not 0, the wild-type and the last
mutant sequence stored in the respective sequence arrays are displayed.
Memory: For a genome of 100 Mbps, 'galign'
requires 4*100,000,000=400,000,000 bytes for the genome array, and about
400,000,000 bytes for the pointer arrays, wild-type, and mutant number
arrays. Memory allocation for the
sequence strings is usually relatively small, unless the number of mutations is
of the order of the number of nucleotides in the genome. Thus, roughly 1.6 Gb of memory needs to
be allocated.
Nucleotide
substitution data can be extracted from the alignment output file using the
SNP_search tool. The output of
SNP_search is a text file containing an ordered list of predicted nucleotide
substitutions characterized by their positions with respect to a reference exon
and with respect to the chromosome sequence, the predicted sequence alteration,
and, in the case of lesions in coding regions, the predicted amino-acid
substitution. The numbers of
wild-type and mutant reads giving rise to the prediction are also displayed
(Figure 2).
To perform its tasks, SNP_search also requires five
accessory files: a file containing a list of exon names, exon start and end
positions, chromosomal locations and strand information; two similar files for
introns and intergenic regions, referenced by the upstream exon in the case of
the intron file, and by the downstream exon for the intergenic region file; a
file containing a list of exon names, the positions of the start and end of
coding regions in each exon, and the reading frame (1, 2, or 3) at the 5' most
nucleotide of the exon (unless it is the first exon of a gene, in which case
the reading frame is assigned the value 1); and a file containing a list of
exon names and their sequences.
Accessory files for the C. elegans genome are present in the 'Feature_locations'
folder.
Before executing SNP_search, the user is instructed to
input three parameters restricting SNP detection: the minimum ratio of the
number of mutant to wild-type reads required for calling a SNP, the minimum
number of mutant reads to call a SNP, and whether SNPs should be scanned in
exons, introns, or intergenic regions.
SNP_search then scans the alignment output file, noting all positions
satisfying the criteria set by the user-provided parameters. Flagged positions are then queried as
to whether they fall within exons, introns, or intergenic regions, as indicated
by the user, by comparing SNP position in the genome with the list of
exon/intron/intergenic region start and end positions. SNP_search broadens the definition of
an exon to include 10 nucleotides upstream and downstream of an exon, to
facilitate identification of splice-site mutants.
For exonic
domains, SNP_search next assesses whether the SNP falls inside a coding
region. If so, the amino-acid
affected by the SNP is recovered by subtracting the SNP position from the exon
start position, calculating modulo 3 of the resulting value, and assessing the
frame with respect to the known frame at the exon start site. The appropriate codon is then retrieved
from the exon sequence files either directly, or by stitching sequences from
adjacent exons in the event that a SNP affects either the first two or the last
two nucleotides of an exon. Both
wild-type and mutant amino acids are then printed to the output file. For introns and intergenic regions,
SNP_search then outputs a file similar to that presented in Figure 2, except
that the column displaying codon changes is absent.
Output:
Output of SNP_search is a text file located in the ‘SNP_results’ folder, with the name ‘yourname_snp_cexons.txt’. For intron/intergenic searches, ‘cexons’ is replaced by ‘cintrons’/’cintergenic’. For exons with strands ‘1’ or’-1’, positions are relative to the 5’ or 3’ end of the exon.
Deletions
greater than one nucleotide: The Deletion_search tool of ‘galign’
identifies all positions in the genome that have no sequence reads, wild-type
or mutant, as defined by the alignment tool output file. In addition to the output file of the
alignment program, Deletion_search also requires 3 accessory files containing
the names, start and end positions, chromosome, and strand locations of exons,
introns, and intergenic regions. For
the C. elegans genome, these are located in the 'Feature_locations'
folder. The program searches
through the alignment output file for an occurrence of 0 wild-type and 0 mutant
reads. Once such a position is
encountered, a running counter measures the length of the unread region, until
a position with non-zero reads is reached. The start and end point of each unread region is then
compared to exon/intron/intergenic region start and end sites to determine
whether the region spans the user-indicated feature set. The output file produces an ordered
list, based on chromosome position, of all genomic locations that lack hits
(Figure 3).
Long stretches of
unread genomic regions arising from sampling error of the genome are predicted
to be exceedingly rare for a standard sequencing experiment. Thus, large stretches of sequence are
likely to arise from three possible situations: First, since the ‘galign’
alignment tool assigns all repetitive sequences to the first occurrence of the
sequence in the genome, large unread stretches could correspond to repetitive
domains of the genome. Second, DNA
regions whose amplification in sequencing reactions is not favored may not be
represented in the sequence reads.
Third, true deletions in the sequenced genome, with respect to the
reference genome, may account for large unread domains.
Distinguishing
among these options is not always possible. However, if coverage of the genome is extensive, runs of
unread regions corresponding to true deletions should be flanked by single or,
more rarely, more than one position containing at least one mutant read. This occurs because 32-nt sequence
fragments that span the deletion site will be read as a mismatch by the
alignment tool, unless the nucleotides at the positions flanking the deletion
site are identical to the deleted nucleotides abutting the deletion site. Deletion_search flags sequences
containing such flanking regions in the output file under the
"Comments" heading.
The
Deletion_search tool will tend to overestimate deletion sizes. Each genomic position can, in
principle, be covered by 32 different 32-nt sequence reads. Thus, the probability of reading a
given position in a standard sequencing experiment is theoretically 99.9997%. However, using the ‘galign’ alignment
tool, genomic positions directly abutting deletion sites can only be covered by
two sequence reads (and much more rarely by a larger number), reducing the
chance of detection. Overall,
therefore, the vast majority of deletions greater than one nucleotide are
predicted to be found by ‘galign’.
Single
nucleotide deletions: Unlike
large deletions, only 45% of single nucleotide deletions will be detected by
the Deletion_search algorithm.
However, of the remaining 55%, 53% will be detected as substitutions by
SNP_search. Overall, therefore, it
is estimated that 98% of single-nucleotide deletions will be detected by
‘galign’.
Insertions: Detection of insertions by ‘galign’
follows similar considerations to deletion detection, however, unlike
deletions, whose size can be estimated, insertion size is either difficult or
impossible to extract from the ‘galign’ alignment output. As with single nucleotide deletions,
insertion sites can be detected either by producing nucleotide substitution
calls by SNP_search, or by presenting as unread genome positions using
Deletion_search. Some insertions
will, assuming full coverage, have a unique signature of two adjacent positions
containing mutant reads. Although
such dyads can arise from independent events, they may be an indication of
insertion sites, and are displayed in the "Comments" column of the
Deletion_search algorithm output.
Multiple
local substitutions: Although
the presence of two or more nucleotide substitutions within a 32 nt region
could be revealed by SNP_search, these regions may also be detected by
Deletion_search, since most sequence reads spanning these areas will contain
more than one mismatch, and will not be read by the alignment tool, giving rise
to unread positions.
Distinguishing such instances from true deletions may not be possible in
‘galign’.
Output:
The output of ‘Deletion_search’ is a text file named ‘yourfirstname_del_cexons(or cintergenic or cintrons).txt’ located in the ‘Deletion_results’ folder in the ‘galign’ folder. The output consists of 7 columns: running index, exon name, chromosome number, strand, position of 5’ end of deletion, position of 3’ end of deletion, deletion length. Negative numbers are upstream of the exon, positive numbers are downstream. For exons with strands ‘1’ or’-1’, positions are relative to the 5’ or 3’ end of the exon.
The
‘simple_SNP’ tool processes files from the Alignment_result folder to reveal positions
in the genome that conform to a user-provided SNP criterion, or that have no reads. The tool does not use information
from the annotated genome, and does not determine whether displayed positions map in exons, introns,
or intergenic regions.
Output:
The output of ‘simple_SNP’ is a text file named ‘yourfirstname_simpleSNP.txt’ located in the ‘SNP_results’ folder within the ‘galign’ folder.
The
‘Format_convert’ tool converts three common Solexa output files (gseq, fastq,
fasta) to a file compatible with the ‘galign’ tool. The file to be converted must be placed on the Desktop prior
to using the tool.
Output:
The output of ‘Format_convert’ is a text file named ‘yourfirstname_reads.txt’ located in the ‘Sequence_reads’ folder within the ‘galign’ folder. If conversion was successful, the output file should contain a list of 32-nucleotide sequences in lower case letters.
Genome_assemble
The
'Genome_assemble' tool allows chromosome sequences from a genome to be uploaded
separately or from a single file.
These sequences are then converted to the 'galign' format. Chromsome files should be in Fasta format.
Output:
The output
of 'Genome_assemble' is a genome file deposited in the 'Genome_sequences'
folder with the name given by the user.
Step-by-step
procedure for identifying SNPs and other alterations in C. elegans
For more
in-depth information, read the sections above.
1. The first step in identifying SNPs
and other alterations in a mutant C. elegans strain is to subject DNA from that
strain to Solexa sequencing. 10x
coverage is the minimum recommended coverage to theoretically guarantee that every
nucleotide gets read at least once.
This corresponds to about 34,000,000 32-nucleotide reads.
2. Next, the sequencing facility
analyzes data collected by the machine, and determines the quality of each
sequence read. Generally you will
be provided with one or more of the following output files:
gseq:
qseq files will have many lines, each of which contains (separated by spaces)
the machine name, run number, lane number, tile number, x coordinate, y
coordinate, indexing value, read number, sequence, quality score, and quality
filter pass status.
Fastq:
Fastq files will have the read ID on one line, the sequence on the next line,
the ID again on the next line, and the quality score on the next line. This is repeated for all reads.
Fasta:
Fasta files will have the read ID on one line, followed by the sequence on the
following line. This is repeated
for all reads.
If you have files in any of these formats you can use the ‘Format_convert’ tool in ‘galign’ to convert them to a file compatible with the ‘galign’ program. To do so, the original sequence file must be placed on the Desktop.
If you do not have a file in one of these formats, you will need to generate a text-only file containing 32-nucleotide reads in lower-case letters, arranged one per line, with no additional text.
To use ‘Format_convert’, double-click on the ‘Format_convert’ icon. This will launch a ‘Terminal’ window. Follow the instructions and enter all the information. The output file will be placed automatically in the ‘Sequence_reads’ folder of ‘galign’.
3. Double-click on the ‘Alignment_tool’ icon inside the ‘galign’ folder to run the alignment program. This will launch a ‘Terminal’ window. Enter all the requested information. The genome sequence file is ‘elegansgenome195.txt’. This program takes about 1-hr to run through 40,000,000 32-nucleotide Solexa reads.
4. To identify single nucleotide substitutions, double-click on the ‘SNP_search’ icon. This will launch a ‘Terminal’ window. Enter all the requested information- the Alignment_results file has the name ‘yourfirstname_galign.txt’ and is located in the ‘Alignment_result’ folder. Typically this program runs for about a minute. The output file can be found in the ‘SNP_results’ folder, with the name ‘yourname_snp_(cexons or cintrons or cintergenic).txt’. The details of the output are elaborated on in the description of the ‘SNP_search’ tool above. If you are interested in all substitutions in the genome, regardless of whether they are in exons, introns, or intergenic regions, you will need to run this program once for each region.
5. To identify putative deletions and other alterations, double-click on the ‘Deletion_search’ icon. This will launch a ‘Terminal’ window. Enter all the requested information- the Alignment_results file has the name ‘yourfirstname_galign.txt’ and is located in the ‘Alignment_result’ folder. Typically this program takes about a minute to run. The output file can be found in the ‘Deletion_results’ folder, with the name ‘yourname_del_(cexons or cintrons or cintergenic).txt’. The details of the output, and what lesions they might represent, if any, are elaborated on in the description of the ‘Deletion_search’ tool above. If you are interested in all putative deletions in the genome, regardless of whether they are in exons, introns, or intergenic regions, you will need to run this program once for each region.
FEEDBACK ON
ANY OF THE PROGRAMS OR EXPLANATORY MATERIAL IS WELCOME.