I want to retrieve a sequence for many species from the Nucleotide database in NCBI.
I'm using a command line approach and I have to figure out what is the best query that will return exclusively the region of DNA I am interested in and filter out unwanted noisy sequences.
I am using a wide range of species (that I have stored as a list of TaxIDs of about 2000 species): it includes small crustaceans, invertebrates, some algae and small vertebrates (reptiles, amphibians and few mammals).
My final goal is to obtain a phylogenetic tree for all or most of the species.
I have been suggested to use these genes to create such phylogeny
- CO1 mithocondrial DNA
- 16S rRNA
- 18S rRNA
I want to formulate a specific query that will return exclusively those sequences.
I'm using the GenBank query builder to visually check for the accuracy of my search and when I find a good query I will use it in the API.
So far I came up with the following queries:
(COX1[Title] OR CO1[Title]) AND complete[Title]--> 63/2000 Species16S[title] AND complete[title] AND rRNA[title] NOT partial[title]--> 9/2000 species18S[title] AND complete[title] AND rRNA[title] NOT partial[title]--> 15/2000 species
As you can see the number of species that I get sequence for is very low compared to the initial 2000 species. I doubt that we have so little available sequences (especially for COX1 that is used for barcoding)
Can you help me understanding whether my queries are good or not? And if possible suggest a better alternative
More Info
A subset of 10 of my species of interest is
Rasbora heteromorpha
Elasmopus rapax
Gasterosteus aculeatus
Palaemonetes pugio
Catostomus commersoni
Daphnia magna
Oryzias latipes
Xenopus laevis
Tigriopus japonicus
Oncorhynchus mykiss
Out of these 10 only 2 have sequences available using the COX1 query
| Name | SeqID |
|---|---|
| Oncorhynchus mykiss | EU186789.1 |
| Xenopus laevis | AB278691.1 |
But Daphnia magna is one of the most commonly used organisms in lab and I found this paper regarding the complete mitogenome of a specific strain of D. magna thus implying that the complete mitogenome of the species is already known. That means that there must be a way to retrieve all the other species that have full mitogenome but no COX1 gene mentioned in their GenBank title.

16S[title] AND complete[title] AND rRNA[title] NOT partial[title], across 1430 eukaryotes, 287 bacteria and 178 archaea when using "nucleotide". – terdon Nov 09 '23 at 09:52CO1orCOX1in human. "Cytochrome c oxidase assembly factor 1" has the symbol "COA1". – terdon Nov 11 '23 at 13:07