Project Documentation & Protocols: Maize Gene Discovery Project: ESTs: FAQs
Contents: Index | Libraries | Reports | Assembly | Annotation | Unigene | Search | Ordering | Protocols | FAQs
What does this heading in a ZmDB EST tell me about the entry and the sequencing protocol?
>gi|5555284|gb|AI881235.1|AI881235 606060E11.y1 606 - Ear tissue ...
giI5555284 = Genbank identifier number
gbI881235 = Genbank accession number
881235.1 = First deposit of 881235 sequence
606060E11.y1 = Stanford identifier: location in the freezer and the sequencing strategy are contained in this identifier
606 = internal project number at Stanford University
060 = number of the 96 well plate within the project
E11 = well location on the 96 well plate
.y1 = reverse sequencing, first load (in capillary sequencing there are
often multiple "fake" loads to clean up the sample, so don't be surprised to see a .5 .. we actually sequenced only once)
Note that x.1 = forward sequencing direction, first loading
606 - Ear tissue cDNA library from Schmidt lab Zea mays cDNA, mRNA sequence
What direction did you sequence from?
Short Answer: Almost always we are sequencing from the 5' end for our "x" direction.
The cDNA libraries we sequence were directionally cloned; when this works the 5' end of the cDNA is next to one of the sequence priming sites, and the 3' end of the cDNA is next to the other sequencing primer location. For all projects except 486 (leaf primordia) our initial sequencing reactions on each 96 well plate (the "x" direction") were 5' primed; on about 10% of the plates we also performed the reverse sequencing (the "y" direction). For project 486 we started from the 3' end ("x") and a subset of plates were sequenced from the 5' end ("y").
Tip: if you see >10 Ts at the end of a sequence, that is most likely the poly(A) tail. Some cDNAs from directionally cloned libraries are inserted in the "wrong" orientation.
How can I find both the forward and reverse sequence of a particular entry?
If you find an interesting sequence with a "y" designation, you will often (but not always) find an "x" sequence. This will have the exact same Stanford identifier number but ending in .x#. In this particular example you could search for the x sequence by pasting 606060E11.x in this link.
As of Feb. 2002, the majority of "x" designations have a corresponding "y" entry; beginning with library 614 (4 day root), we have attempted sequencing from both the 5' and 3' directions on most plasmids. To check if a particular "x" direction EST has a corresponding "y" entry, paste the sample identifier number 606060E11.y into this link.
Why would I want to look for the x and y sequences from a given EST? Won't they form a contig that I can find compiled in a table?
Contigs form when the sequences from the 5' and 3' end of the EST overlap in the middle. The longer the cDNA insert, the less likely it is to form a contig.
Our experience thus far is that about half of the items sequenced from the 5' and 3' (x and y) ends form a contig. These contigs vary in length from ~500 - 1200 bases; the average length varies by library but is usually between 900 - 1000 bases. The other half of the x and y sequences on a single cDNA plasmid failed to contig; either one direction failed to yield usable sequence or the insert size is likely to be >1 kb.
Tip: Plasmids with long "x" and "y" sequence that fail to contig in the middle are a good EST to request if you want to obtain more sequence information.
Why don't you sequence from both ends of all ESTs?
Time, money, etc.; there are real limitations. For the first half of the EST project we are reverse sequencing about 10-15% of the 96 well plates. We choose the plates with the highest yield of unique sequences in the forward direction. From the forward and reverse sequence we gain valuable insight into the average insert length for a library, and we maximize the chances of forming long contigs during the final assembly.
What is the future plan of your EST project?
For the second half of our EST project, we plan to perform more reverse sequencing. In addition, we will pick 500 - 1000 unique ESTs that match RescueMu inserts and obtain the full length sequence.
When the EST project is finished and fully analyzed, we plan to select the unique ESTs to produce a preliminary "unigene set" for maize. This collection will be microarrayed onto glass slides; you can purchase these microarrays from the Maize Gene Discovery project. To confirm the identity of items in the unigene set, some or all of these ESTs will be sequenced again from both the forward and reverse directions. Until the "unigene" set is ready in late 2000, we are microarraying the ESTs as we go along. Microarraying is done at the University of Arizona.
Click here to see how to order an EST.
Click here to check on the status of the microarraying project for corn GSTs. We plan that microarrays from our initial libraries will be available in the late fall of 1999. The Maize Gene Discovery project will perform preliminary experiments and controls, and the data will be posted at ZmDB. We will also provide protocols for probe preparation and hybridization. If you are interested in global patterns of gene expression to discover genes up- or down-regulated by a treatment, in various tissues, or in mutant plants, we encourage you to try the microarrays as soon as they are available. The likely cost of ~$150 for two slides (each with 3,000 - 5,000 items in duplicate) will cover our replacement costs. In the standard technique, you compare two RNA samples (control to treatment). Although arrays are about three times more expensive than the first use of a "northern blot" the arrays provide much more information and require much smaller RNA samples.
How are project numbers assigned?
At the Stanford Genome Centers, each new type of project, i.e. a new cDNA library, receives a project number. All materials in that project are bar-coded with the project number and a sequential plate number.
How good is your sample tracking?
Before the MegaBACE capillary sequencers can be loaded, the bar code information must be read into the machine. These procedures facilitate sample tracking. If you request an EST, the datasheet that comes with your sample explains what to do if you receive the wrong sample (this is a rare occurrence).
How do you decide how many ESTs to sequence from a particular library?Our goal is gene discovery, to find as many different maize genes in our 50,000 ESTs project as possible. For every 96-well plate we sequence, we evaluate the number of new sequences and the fraction of attempts that were successful. We multiply these two numbers together (i.e., 75 new sequences X 0.80 success) to chart the "yield" from a plate (in this example 0.6). Typically 4-6 plates are evaluated from the same library on a given day, and an average "gene discovery" value is calculated for that day. We typically sequence until the "gene discovery yield" approaches 0.3 (i.e., 48 new sequences X 0.70 success = 0.336, the end is near for this project). To maintain quality, we only continue provided the length of ESTs is close to or exceeds 500, and the quality score on the base-calling is high (phred average 35-43).
Sequencing success is a specific characteristic of each cDNA library. Some libraries have some short or junk inserts; these libraries must have a very high fraction of novel sequences for us to continue sequencing. Other libraries show excellent sequencing success, and we can continue gene discovery deeper into the library.
What if I find some "junk" in the EST database?
Please let us know by e-mail. Click here to e-mail ZmDB manager Qunfeng Dong, please describe what you found. Examples of "junk" that sometimes creep in include lambda phage and E. coli sequences. We plan to compare EST sequences from normal silks and silks infected with Fusarium to identify some common Fusarium genes. If you have sequences from a common maize pathogen or field microbe, please e-mail us the information and an explanation. It would be helpful to everyone if we can pinpoint sequences that are present in maize cDNA libraries but are actually from microbial fellow travelers.
What are the advantages and disadvantages of capillary sequencing?For the MegaBACE (Molecular Dynamics) machine major advantages include easy sample loading (all 96 wells are loaded simultaneously) and a quick run time (<3 hours for 600+ read length). With bar coding, loading and sample tracking problems are minimized as no sample is handled individually.
Disadvantages of capillary sequencing include the increased expense of the machine compared to gel-based sequencing, although this difference is not important if you plan to perform lots of sequencing. We encounter day-to-day problems with sample purity and concentration: too little DNA, too much template DNA, excess salt and dust can all cause failure or result in short sequences of low quality. Some problems can be overcome by performing a "mock loading" or even several mock loadings, before starting the actual sequencing run. Presumably the mock loading procedure removes some of the junk in samples.
Constant Improvement is our Motto. We are always working on protocol improvements to provide samples that are more uniform in DNA concentration and with as little "contamination" as possible. The original members of our sequencing team are Khaled Sarsour, Gurpreet Randhawa, and Brian Nakao. The current members are Brain Nakao, Bret Schneider, and Darren Morrow.
How do you select which sequences are of sufficient quality and length for reporting to GenBank?
Vector sequences are automatically trimmed off the "raw sequence" after base calling is performed. The trimmed sequences less than 100 bases are rejected (the well is listed as a failure), and sequences with a phred score <15 are also rejected (the well is listed as a failure).
Our typical phred scores are ~35 - 43. A phred score of 20 means 1 base calling error is likely in every one hundred bases; a phred score of 30 means 1 base calling error is likely in every one thousand bases; and a phred score of 40 means 1 base calling error in every 10,000 bases. We are performing single pass, single strand sequencing. To insure that the "quality scores" are realistic, we hand-check some of those ESTs that match existing maize genes by performing a BLAST search against maize genes at GenBank. For ESTs from 100 - 600+ bases, identities are 97-100%. Some of the mismatches are likely to be true polymorphisms, and some are sequencing errors.
Where can I learn more about the Stanford Genome Centers and their high throughput procedures?
Look at the Stanford Human Genome Center, Saccharomyces Genome Project, and the Arabidopsis Database
Return to Documentation Index | Return to Maize Gene Discovery Project index | Return to Homepage