USGENE database reload provides updated BLAST and GETSIM versions and new searching capabilities
The patent sequence database USGENE, providing all available peptide and nucleic acid sequences from the published applications and issued patents of the United States Patent and Trademark Office (USPTO), has been reloaded and enhanced on STNext.
In addition to faster search processing, the highlights of the new version of the USGENE are:
- New BLAST version and additional BLAST search options
- New FASTA version
- Expanded functionality for BLAST and GETSIM searches
- Better display of search results, New sorting option
- New search fields for the composition of nucleic acid and protein sequences
- Better compatibility with PATGENE and GENESEQ databases
- Better compatibility with full text patent databases
- Maximum number of hits increased
NEW BLAST VERSION AND ADDITIONAL BLAST SEARCH OPTIONS
USGENE now uses BLAST version 2.12.0. Four additional search options have been introduced, allowing for more precision in search results:
- /SQM - the "megaBLAST" algorithm, for searching highly similar nucleotide sequences
- /SQDM - the "discontiguous megaBLAST" algorithm, for searching similar nucleotide sequences but
allowing more mismatches
- /TSQP - the BLASTx algorithm, for searching nucleotide sequences translated from PATGENE protein
- /TSQNX - the tBLASTx algorithm, for searching translated nucleotides from PATGENE protein sequences
Additional details on these new search options can be found by typing HELP BLAST or HELP TLATION at an arrow prompt while in USGENE.
NEW FASTA VERSION
The FASTA algorithm, invoked by RUN GETSIM, has been updated to version 36.3.8h. It now allows searching of sequences up to 30K characters in length. The available search options are the same as before: /SQN for searching nucleotides sequences, /SQP for searching amino acid sequences, and /TSQP translating a nucleotide query in all six reading frames to an amino acid sequence and searching in the protein sequences. The display of the parameters, the overview diagram and the alignments are now the same for GETSIM and BLAST searches. Updated HELP information is available is available in HELP GSIM.
EXPANDED FUNCTIONALITY FOR BLAST AND GETSIM SEARCHES
For BLAST and GETSIM searches, answer sets can be generated in addition to the percentage score value of an alignment also with the percentage identity value of an alignment or a combination of both. See details in HELP BLAST or HELP GSIM.
IMPROVED USABILITY OF MOTIF SEARCHING (RUN GETSEQ) RESULTS
To improve the usability of Motif searching results, the entire answer set is now always included within a single L number. HELP GSEQ has been updated and includes additional information.
BETTER DISPLAY OF SEARCH RESULTS, NEW SORTING OPTION
New displays of similarity results are now available. For each BLAST or GETSIM search, two diagrams are now generated to provide an overview of the similarity between the retrieved sequences and the query:
- the number of answers, and
- a score for the specific degree of similarity for the search
For BLAST and GETSIM searches, L-numbers are each generated by entering ALL, an absolute score number, a score percentage, an identity percentage or a combination of the percentage. Each L-number can be used for further processing. While the default search results display is sorted by descending Accession Number, the ability to sort by descending Similarity Score (SORT SCORE D L1) has been retained and the ability to sort by descending Percent Identity (SORT IDENT D L1) has been introduced in USGENE. The capability to sort by Descending Percent Identity is now also being introduced in PATGENE and GENESEQ.
Alignments can be displayed for all three RUN options (BLAST, GETSIM, GETSEQ) as text with the display format ALIGN or as an image with ALIGNG.
NEW SEARCH FIELDS FOR THE COMPOSITION OF NUCLEIC ACID AND PROTEIN SEQUENCES
The introduction of new search fields reporting the nucleotide and amino acid composition of a specific sequence makes it possible to refine your searches to find sequences with a particular type of content. The new fields are as follows:
/AA - retrieves amino acid codes expressed as single characters
(see HELP AAC for the definitions of the amino acid codes)
/NA - retrieves the nucleotide codes (see HELP NUC)
/AA.CNT - retrieves the number of amino acids
/NA.CNT - retrieves the number of nucleotides
/AA.PER - retrieves the percentage of amino acids in the sequence
/NA.PER - retrieves the percentage of nucleotides in the sequence
Range searching is possible for the /AA.CNT, /NA.CNT, /AA.PER, and /NA.PER fields. Use the (S) proximity for precision searching results.
For example, nucleotides with high GC-content (Guanine, Cytosine) can be retrieved with:
=> S (G OR C)/NA (S) 60-100/NA.PER
BETTER COMPATIBILITY WITH THE PATGENE AND GENESEQ SEQUENCE DATABASES
While USGENE already had the Patent Sequence Location (/PSL) and Sequence Count (/SEQC) fields, their recent addition to PATGENE and GENESEQ means that the same sequence-specific searches can now beperformed in all three databases.
For every sequence in USGENE, the SHA-2 algorithm has been applied and indexed in the new field Sequence Key (/SEQK). The generated string (e.g., A0000030BD19782FC1774AF58E4CFFEE7F0E30588CBA14DCD38C), is specific to a sequence. Identical sequences receive the same string, regardless of the database of origin, or the organism from which the sequence was isolated. Further details on using the /SEQK field for efficient duplicate identification will be communicated in due course.
COMPATIBILITY WITH FULL TEXT PATENT DATABASES
Search fields common to the patent full text databases are now also available in USGENE:
/APO Application Number, Original
/DED Data Entry Date
/DUPD Data Update Date
/INA Inventor Address
/PAA Patent Assignee Address
/PNO Publication Number, Original
/PRYF Priority Year, First
/PRNO Priority Number, Original
/RLPC Related Publication Country
/RLPD Related Publication Date
/RLPN Related Publication Number
/RLPY Related Publication Year
/RLT Related Application Type
MAXIMUM NUMBER OF HITS INCREASED
The default maximum number of hits has been increased to 15,000.
The new parameter "-maxseq" allows the maximum number of hits to be increased to 100,000, but a larger maximum will mean a longer processing time. Example of setting maxseq to 100,000:
= > RUN BLAST L1/SQN -F F -MAXSEQ 100000
Although BATCH searches are not possible, L-numbers from sequence searches can be saved with the command SAVE and reactivated with ACTIVATE.
Alerts for sequences are not possible for the time being but can be set up for bibliographic fields.
The new Database Summary Sheet for USGENE is available at: