100

In Hofstadter's Gödel, Escher, Bach: An Eternal Golden Braid (GEB), the following claim appears:

...in the species Felis catus, deep probing has revealed that it is indeed possible to read the phenotype directly off the genotype. The reader will perhaps better appreciate this remarkable fact after directly examining the following typical section of the DNA of Felis catus:

...CATCATCATCATCATCATCAT...(OP note: truncated because, you get it)

Is this true? A cursory search for the DNA of Felis catus gives me this 1996 paper by Lopez, Cevario, and O'Brien and the given sequence does not appear – there are some instances of "CAT" but not repeated enough to make it as remarkable as claimed in GEB.

I don't know enough Biology to judge the veracity of this claim. Some points I am considering are:

  • GEB is full of wordplays. However, the tone of this part of the text does not sound like one to me.
  • GEB was written/published around 1978. The paper I linked to – which was cited by some 236 others according to Google – was published in 1996, way after GEB's time. If my impression that Lopez et al.'s work is significant because it is the first time Felis catus has been sequenced, then there is no way Hofstadter could've known of it when he wrote GEB. Then again, I don't know enough Biology that there might be some nuance to Lopez et al.'s paper that I'm missing (i.e., the results of the paper might not be mutually exclusive to the claim made in GEB).
  • GEB has reference notes and bibliography and there is no reference cited to back this claim. However, GEB does not attempt to be a rigorous academic thesis and the references is only called upon more when Hofstadter quotes other works directly while the bibliography is a list of readings which the reader may want to check out, regarding the main thesis of the book.

So are cats recursions with no base cases?

skytreader
  • 1,079
  • 2
  • 7
  • 9
  • 14
    Welcome to BiologySE! You came with a bang! ;) – Failed Scientist Oct 10 '16 at 02:32
  • 1
    As a rule: when you find amazing scientifically impressive postulates in GEB, you better make sure whether you detected them in one of the dialogs rather than the main material. Hofstadter's dialog characters tend to take a considerable amount of poetic liberty with the underlying science. I don't know the section you found this in, but the statement about the phenotype being deducible from the genotype very much sounds like having more tongue in cheek than an average cat could hope for. –  Oct 11 '16 at 18:39
  • 13
    Does Lady Gaga's DNA have GAGAGA? – djechlin Oct 12 '16 at 18:36
  • 2
    While I don't know the cat genome, I'm pretty sure the dog genome doesn't contain DOG. – user137 Oct 14 '16 at 14:37
  • 1
    I'm voting to close this question as off-topic because, amusing though it may be, but the relationship between the Englsih word cat and the abbreviations used for molecules can in no respect be regarded as being about biology. Perhaps it has something to do with statistics or English literature. – David Apr 30 '17 at 14:28
  • I’m voting to close this question because this is trivial and stupid. – David Jan 25 '21 at 22:57
  • I’m voting to close this question because as @David has pointed out in the past this question is trivial and has no biological relevance. – tyersome Mar 07 '21 at 21:08
  • 2
    @tyersome Yet, it has been upvoted at least 92 times and favorited 22 times. – Rodrigo de Azevedo Mar 07 '21 at 21:11
  • I’m voting to close this question because it is of no scientific interest whatsoever. – David Sep 15 '22 at 16:53
  • Interesting. I now see both @tyrsome and I voted to close this previously Why were our close votes removed? – David Sep 15 '22 at 16:55
  • 1
    @David Bcoz no one else supported and your votes aged away. They age away after 14 days. – Mesentery Sep 18 '22 at 05:27
  • @user237650 — Thanks for the info. Well, at least the comments don't. – David Sep 18 '22 at 08:43

4 Answers4

102

The Felis catus genome has been published, annotated, and updated quite a bit since 1996, including spans of so-called intergenic regions, which are basically scaffolding and other structures, along with perhaps some unidentified genes, pseudogenes, regulatory sequences, etc. Basically, pretty much the entire DNA sequence is available now, not just the gene sequence of the mitochondrial genome, which was what was published in the 1996 paper you referenced. Mitochondria are the power plants of the cell, but are just an organelle that happens to contain its own DNA; they are separate from the chromosomal DNA in the nucleus. All of this is available for free (if you know where to look) at the National Center for Biotechnology Information (NCBI), part of the National Library of Medicine (NLM) at the National Institutes of Health (NIH) in the United States. Other sites are also available, such as Ensembl, a joint project between the European Bioinformatics Institute (EMBL-EBI), part of the European Molecular Biology Laboratory (EMBL), and the Wellcome Trust Sanger Institute (WTSI). Both institutes are located on the Wellcome Trust Genome Campus in the United Kingdom.

So, to the genome. Genomic sequences can be searched in a couple of different ways, depending on what you're looking for, but the most common way is to use BLAST, the Basic Local Alignment and Search Tool. As the name implies, it takes sequences as input and searches one against the other, aligning the results as best as possible using certain algorithms that the user can define and tweak. The BLAST web interface to the cat genome is here. You don't need to worry about any of the other options here except the "Enter Query Sequence" box. FASTA format is just using the single-letter abbreviations for nucleotides (AGCT), all strung together.

The genome we're searching is of an Abyssinian cat named Cinnamon:

Cinnamon

Cinnamon, the cat which was chosen to be the definitive genetic model for all cats in the feline genome project. Image courtesy of the College of Veterinary Medicine at the University of Missouri.

To start with, I typed in CATCATCATCAT and to my surprise got back over 200 hits, covering every chromosome the cat has. So, I doubled the length of the input to 8 CATs, and got back the same result set. Unfortunately, 12 CATs was too many (and really, it is too many), so I worked backwards.

The final results are here (sorry, link expires 10/13/16. To regenerate, go to BLAST link above and enter CATCATCATCATCATCATCATCATCATCAT). Apparently, popular wisdom is incorrect, and Felis catus chromosomes really contain 10 CATs each, one more than is needed for their 9 lives. No word yet as to why this may be, but scientists are presumably working on it.

MattDMo
  • 15,286
  • 4
  • 48
  • 64
  • 11
    The irony might be taken seriously, I don't think it is a good place for joking. – alephreish Oct 09 '16 at 14:58
  • 25
    @har-wradim what is ironic here? The last sentence? No harm there as I'm not interested in deep research on cats, per se. I find the answer well-detailed and, limited as my biology knowledge is, Matt's explanation adds up, reproducible and verifiable. Well, NCBI Blast's UI isn't Apple-quality but it holds up as far as I can interpret it. – skytreader Oct 09 '16 at 15:10
  • 3
    My question is, is felis catus the only species for which this is true? I would suppose otherwise. – John Dvorak Oct 09 '16 at 17:09
  • 9
    @Jan: I would say it's highly unlikely. This is just pattern matching with an astronomical input set. – Lightness Races in Orbit Oct 09 '16 at 17:41
  • @JanDvorak can you come up with an animal spelled with A, T, C, and G? – svavil Oct 09 '16 at 22:10
  • 26
    THE MOAR YOU KNOW: Rumor has it that some well known pop singer has lots of "GAGA" in her DNA. In other news: every other carbon-based lifeform does too. – hmijail Oct 09 '16 at 23:07
  • 7
    I heard that the entire cast of Gattaca had GATTACA sequences in their DNA. Most of the crew did to. I smell a conspiracy. I think they even passed a law about large numbers to cover it up. – candied_orange Oct 10 '16 at 03:38
  • 41
    My background is in mathematics; I joined the Biology StackExchange in order to leave this comment. The cat genome, like our own, is about 3 billion base pairs long. The probability of matching a sequence of n base pairs starting at any given position is 1 in 4^n (since there are 4 possible base pairs), which for n=12 is about 1 in 16 million. That means you'd expect to find around 200 matches for CATCATCATCAT if all short sequences were equally likely. This won't be perfectly true, but as March points out the widespread existence of tandem repeats makes matches like this still more likely. – Robin Saunders Oct 10 '16 at 03:53
  • Reading many of the comments on this page, I get the feeling that while some of us here are having fun, others, less informed, are completely confused. My vote goes to @March Ho. – alephreish Oct 10 '16 at 07:15
  • @har-wradim what exactly does that mean? – MattDMo Oct 10 '16 at 12:25
  • 4
    @har-wradim never mind, I see. Some people have completely missed the joke, and are taking this way too seriously. – MattDMo Oct 10 '16 at 12:28
  • 2
    @skytreader "NCBI Blast's UI isn't Apple-quality but ..." meaning that it cannot search apples' genomes? or.. ? – Danilo Ramirez Oct 11 '16 at 13:55
  • @DaniloRamirez Apple as in where Macbooks come from. A lot of scientific tools, heck even industry products, have UIs that aren't Apple-quality but are nonetheless useful and powerful. – skytreader Oct 11 '16 at 15:09
  • 2
    I very much appreciate the humor at the end there :) – L.B. Oct 12 '16 at 15:59
  • @skytreader Oh that Apple.... please don't... :) the pun was intended from the initial comment, but thanks for keeping it very explanatory – Danilo Ramirez Oct 13 '16 at 18:04
  • @MattDMo So do I understand it correctly, could there be some other cat different from Cinnamon that can have more CATCATCAT... than her? Because we only analysed one genotype and those of other organisms of the same species can differ, right? – nuoritoveri Jan 27 '17 at 10:10
  • 1
    @nuoritoveri Yes, that is possible. – MattDMo Jan 29 '17 at 19:45
  • I think this is the longest one https://www.ncbi.nlm.nih.gov/nuccore/NC_018733.3?report=fasta&from=72130713&to=72130839 $(CAT)_{41}$ (reverse complemented) perphaps you forgot to uncheck 'mask low complexity regions' ? – reuns Jan 27 '21 at 10:12
66

While Matt's answer is perfectly correct, it is important to note that the sequence $(CAT)_n$ in DNA is not restricted to cats, and you would expect to find it anywhere.

For example, searching the human genome for the same 3-tandem repeat CAT sequence results in many hits as well.

This is because you are essentially searching for short tandem repeats on the DNA strand. These repeats can occur in any organism, and therefore while finding CAT substrings in the DNA of the cat may be amusing, they aren't special to cats (or any other animal) and are only the result of an artifact of naming of the bases coincidentally matching with the name of the animal.

March Ho
  • 9,452
  • 4
  • 38
  • 74
  • The bases are not just 'named', they represent the four nitrogenous bases: adenine, cytosine, guanine and thymine. – SummerEla Oct 09 '16 at 22:27
  • 27
    @SummerEla While you are right, I don't see how it's inaccurate to call that "naming". – March Ho Oct 09 '16 at 22:30
  • Well, it's more like an acronym than a naming system: those three nucleotides together (called a codon) ultimately work together in strings of codons to code for a specific protein. – SummerEla Oct 09 '16 at 22:48
  • 17
    @SummerEla If the bases were called adenine, bytosine, cuanine and dymine, then you would have BADBADBAD. If they were called qurine, quadrium, quitterium and quinterone, then you'd have QQQQQQQQQ. And so on. By renaming them, you can make up any short word containing only four different letters, and find it in whichever chromosome of whichever animal you want - for example you could make the human Y chromosome contains "MENS". – user253751 Oct 10 '16 at 02:50
  • 1
    @immibis what?

    My point was that the bases are not named arbitrarily, they actually stand for nucleotides that comprise amino acids, which when combined make up proteins.

    – SummerEla Oct 10 '16 at 03:11
  • 26
    But the naming of the nucleotides themselves is ultimately arbitrary. According to http://etymonline.com, adenine is "so called because it was derived from the pancreas of an ox", while guanine is named "from guano, from which the chemical first was isolated" and thymine "from thymic acid, from which it was isolated" (cytosine comes from cyto- meaning "cell"). Had the discoveries occurred differently, those chemicals would have very different names. – Robin Saunders Oct 10 '16 at 04:03
  • 1
    HAHAHA (no, not a remapped/renamed nucleotide sequence--really laughing). For some reason, this makes the GEB claim funnier. Thanks for the clarification! – skytreader Oct 10 '16 at 04:45
  • @SummerEla Well yes, when the sequence "cytosine, adenine, thymine" encodes a particular thing. So does the sequence "quadrium, qurine, quinterone" because those are actually the same sequence, I'm just using different names to refer to the same bases. – user253751 Oct 10 '16 at 08:07
  • @SummerEla So does "thymine, adenine, cytosine" in an alternate timeline where the word "thymine" refers to the base with one ring and an NH2 subgroup, and "cytosine" refers to the base with one ring and no NH2 subgroup. – user253751 Oct 10 '16 at 08:07
  • 2
    @SummerEla Also, I'm pretty sure nucleotides don't comprise amino acids, only code for them. – user253751 Oct 10 '16 at 08:08
  • So as per Matt's answer (that there is a 10x sequence), can that 10x sequence be found in any other organisms? – Doktor J Oct 12 '16 at 16:09
16

To augment the other answers, let's compute the probability of CATCATCATCAT occurring in random DNA sequence.

Cat DNA length is 2.7 gigabases (source), and there are 4 possible bases. For 1 CAT there are 3 bases, giving expected number of occurrences in 2.7 Gb as $\frac{2.7 \cdot 10^9}{4^3} \approx 42\,188\,000$

Repeating the calculation for longer sequences gives:

  • 1 CAT: 42 188 000 occurrences
  • 2 CAT:      659 180 occurrences
  • 3 CAT:        10 300 occurrences
  • 4 CAT:             160 occurrences
  • 5 CAT:                2 occurrences
  • 6 CAT:                0 occurrences

So, indeed, there are many more CATs in cats than could be expected by pure chance alone.

jpa
  • 301
  • 1
  • 3
  • 18
    It wouldn't be too surprising if repeated sequences were more likely to occur than most sequences. – user253751 Oct 10 '16 at 08:09
  • 14
    DNA is not such a simple random sequence and, in particular, repeats occur above likelyhood. This is therefore not a good approach. – Jack Aidley Oct 10 '16 at 10:55
  • 4
    @JackAidley In my opinion, this is a good approach to demonstrate exactly that repeats occur more often than they would in a random sequence. – jpa Oct 10 '16 at 10:56
  • 5
    @jba: It does. But there's nothing special about the sequence 'CAT' in the cat genome. It's a general property of repeats. Perhaps you could edit your answer to make it clear the point you're making, and why? – Jack Aidley Oct 10 '16 at 11:12
  • Interpreting the expected number of occurrences as a Poisson parameter, you can interpret the occurrences of 6 CATs as a probability (through the transform $\lambda\mapsto1-\exp(-\lambda)$) of about 4% that you would have that many in a random sequence. As jpa points out, this is a good argument that STRs like CATCAT... are more likely than chance alone would suggest. – Charles Oct 10 '16 at 14:26
  • 3
    Rather, it is an argument that DNA sequences aren't random in the way this calculation supposes. – reinierpost Oct 12 '16 at 10:01
  • @reinierpost it's random from the CAT point of view. The meaning we attribute to CAT is arbitrary to DNA. It's as random as the index into the number Pi that you've have to start at to find a cat video. – candied_orange Oct 14 '16 at 11:52
  • @CandiedOrange: That is not what this answer means by 'random'. It assumes the C, A, T and G elements are completely random in the sense that the probability of one of them appearing at a specific spot in the sequence is completely independent of what the surrounding elements in the sequence are - and that isn't the case. – reinierpost Oct 14 '16 at 11:57
  • @reinierpost those two ideas of random are the same idea of random is my point. – candied_orange Oct 14 '16 at 12:09
10

So, there are a few great answers here already, but it seems nobody addressed an interesting part of your question: GEB was published in 1978 and the genome of Felis catus was not sequenced until many years later... so how did he know?

jpa's answer shows that you'd expect to get only about five CATs - not ten, and the chance of getting ten is astronomically low. I expanded his table to show the depressingly low chance of getting ten by perfect randomness:

5 CAT: 2.5 expected per Felis catus genome
6 CAT: 0.04 expected
7 CAT: 0.00061
8 CAT: 9.54 e-6
9 CAT: 1.49 e-7
10 CAT: 2.32 e-9

That means you'd expect to find 10 CATs about 0.00000000232 times per random genome. So how on earth did the Felis catus genome end up with ten CATs in it? And how did Hofstadter know that there would be this many CATs?

As it turns out, this repeated sequence of a few base pairs is called a "short tandem repeat", or "microsatellite". This is when a 2-5 base pair sequence is repeated several times, usually between 5 to 50 times.

So at this point, to recap: we know the chance of getting this 10 CAT sequence is slightly more probable, but since we're restricted to just the Felix catus genome we definitely aren't guaranteed a 10xCAT sequence. So how did Hofstadter state it as if it was a fact?

As it turns out, one critical property of STRs, or short tandem repeats, is that mutations in these areas are far more common, and they represent a large amount of the genetic variation between individual members of a species. This discovery was made with the advent of DNA sequencing, which began only a few years before the book was published. Therefore, given a large population of nonidentical cats (which we have), we can confidently say that there is an extremely high chance for a 10xCAT sequence.

Hofstadter's genius perfectly combined math (only 2.32e-9 expected sequences per genome) with biology (microsatellites increase the chance of finding this sequence) with forensic genetics (in a population of the same species, individuals are likely to have many STR-related differences.) All of this put together gave Hofstadter what he needed to confidently say: yes, CATCATCATCATCATCATCATCATCATCAT almost certainly exists in the Felis catus DNA. Little things like this are why Godel, Escher, Bach is my favorite book of all time.

Owen Versteeg
  • 246
  • 2
  • 4