37

BBC News recently published an article saying that:

An image and short film has been encoded in DNA, using the units of inheritance as a medium for storing information ... The team sequenced the bacterial DNA to retrieve the gif and the image, verifying that the microbes had indeed incorporated the data as intended.

This is the image:

The news article shows an image of a hand (shown above) and a short film (not shown here) of a horse rider that was encoded into the DNA "using a genome editing tool known as Crispr [sic]".

My question is, what does this mean? Did the scientists break down an image into 0's and 1's and (install?) it into bacteria? How does a scientist (download?) an image into bacteria and then (redownload?) the image later? How does DNA hold information of a picture that can be (downloaded)?

another 'Homo sapien'
  • 14,121
  • 5
  • 60
  • 92
PiratePi
  • 373
  • 3
  • 5
  • 7
    I'm just going to migrate this to [biology.se], I believe you'll get a better answer there. By the way - the BBC article links to the Nature journal article in which this work was published. That's the first place you should start trying to read from (although I wouldn't blame you if you didn't understand it). – orthocresol Jul 13 '17 at 14:32
  • 2
    It's refreshing to see the actual CRISPR part of the CRISPR-Cas system being used. – canadianer Jul 13 '17 at 17:56
  • 11
    "Did the scientists break down an image into 0's and 1's" Digital images are already 0s and 1s. No need to "break down" anything. – Lightness Races in Orbit Jul 13 '17 at 23:09
  • 2
    Just an off-topic note: Stating a "short movie of a riding horse" I think its probably the first movie made in history "Race Horse" which actually was just were several stringed pictures. https://movies.stackexchange.com/a/42182/20039 – Zaibis Jul 14 '17 at 12:07

3 Answers3

26

The image was not in the DNA as such, only as an abstract representation that could be converted into an image from knowledge of the code. Briefly, they encoded the image into DNA, using a couple of different strategies in which DNA represented pixels -- either with a single DNA base representing a pixel, or with a triplet representing a pixel. Knowing the code they used, they could then extract the information and turn it back into an image.

Quoting from the original article, CRISPR–Cas encoding of a digital movie into the genomes of a population of living bacteria:

We began with an image and stored pixel values in a nucleotide code ... We first encoded images of a human hand using two different pixel-value-encoding strategies: a rigid strategy, in which 4 pixel colours were each specified by a different base; and a flexible strategy, in which 21 possible pixel colours were specified by a degenerate nucleotide triplet table ... To distribute the information across multiple protospacers, we gave each protospacer a barcode that defined which pixel set (denoted as ‘pixet’) was encoded by the nucleotides in that spacer. Four nucleotides define each pixet, and the pixels of a given pixet are distributed across the image ...

Their 21-color strategy is outlined in this figure:

enter image description here

Note: The paper isn't open-access. If you want a full-access version, Church often puts freely accessible versions of his papers on his web site; this paper, #441 on his list, is still shown as "in press" there, but check back at intervals and maybe it will be available there

iayork
  • 14,224
  • 2
  • 41
  • 55
  • For clarification, if I had a square image of lets say 9 pixels (3x3), I would assign "arbitrary" bases to each pixel lets say, line1: [G A T], line 2: [T A C], and line 3: [A A A]. And I make an arbitrary rule stating this 3 line code of bases is equivalent to this 9 pixel picture. I then install this code using CRISPR method into bacteria and read it back. Simply put, is this what the scientists did? – PiratePi Jul 13 '17 at 16:09
  • 16
    Just to be clear to the OP, this isn't conceptually any different than encoding images in binary, except there are 4 possible states instead of just 2. Effectively, each base in DNA is 2 bits. – Bryan Krause Jul 13 '17 at 16:14
  • @PiratePi conceptually that's pretty much right. You describe arbitrary encoding for a full image, they did it using arbitrary (but consistent) encoding per pixel, but that's the only difference. – iayork Jul 13 '17 at 16:39
  • Just to add an explanation of one point that may not be clear (and might usefully be incorporated into the answer). GIF is a format for colour images that allows images of up to 256 red-green-blue colours (2^8). A colour table defines what colour corresponds to each of the 256 number values. The genetic code will only allow a maximum of 64 colours to be defined from a DNA sequence. These 64 colours can still be interpreted by software that can interpret GIF image encoding — the fact that the other 192 possibilities are not used is irrelevant. Likewise for 21, rather than 64. – David Jul 13 '17 at 17:39
  • 4
    Though there's nothing stopping them from using 4-base "codons" to get 256 colours. – canadianer Jul 13 '17 at 17:48
  • 1
    @canadianer Indeed; for this purpose there is nothing any more special about using 3-base codons than there is using 8-bit bytes. – Bryan Krause Jul 13 '17 at 19:45
  • Is there a reason that AAG doesn't map to a number? – Andrew Jul 13 '17 at 22:42
  • 6
    "The image was not in the DNA as such, only as an abstract representation that could be converted into an image from knowledge of the code" Right, which is what encoding means. The image absolutely was "in the DNA" ... and the subsequent faithful extraction proves it. – Lightness Races in Orbit Jul 13 '17 at 23:08
  • 2
    @AndrewPiliser This would be a great, separate question. AAG is the PAM used by E. coli which is necessary for protospacer acquisition, or at least greatly increases acquisition efficiency. – canadianer Jul 14 '17 at 02:59
  • What's exactly a protospacer? – Mockingbird Jul 14 '17 at 04:41
  • "Four nucleotides define each pixet, and the pixels of a given pixet are distributed across the image" Are these 4 nucleotides are the 1st base of a triplet codon? – Mockingbird Jul 14 '17 at 04:59
  • The mentioned paper isn't free accessible. – Mockingbird Jul 14 '17 at 05:45
  • 3
    @Konrad Rudolph they did both. "a rigid strategy, in which 4 pixel colours were each specified by a different base; and a flexible strategy, in which 21 possible pixel colours were specified by a degenerate nucleotide triplet table" – iayork Jul 14 '17 at 10:00
  • @Mockingbird The four-nucleotide approach was a different, simpler but less flexible, strategy from the triplet strategy. – iayork Jul 14 '17 at 10:01
  • @iayork Thanks for the clarification, turns out I misread the comment that that [the comment that I criticised] was replying to. – Konrad Rudolph Jul 14 '17 at 10:22
  • 1
    I hate to pile on with more comments here, but I should correct my previous statement that there is "nothing stopping them from using 4-base codons". In fact, I see in the paper that they were already concerned about the cost of synthesizing all of these oligonucleotides. – canadianer Jul 14 '17 at 11:15
  • @LightnessRacesinOrbit I'd suppose if the image was encoded in the DNA, then the bacteria would be able to construct some protein which would look like that image. Alas, it wasn't, even remotely. Instead, it's just that the DNA was used as a medium for storage of image data, which is much less exciting. – Ruslan Jul 15 '17 at 16:53
17

Just to add what might have been missing in the beautiful answer by @iayork. I just want to give a more simple picture of the encoding done in the E. coli DNA.

  • First for the rigid strategy in which 4 pixel colors were each specified by a different base, suppose we have a sequence:

    AAGCCCTGGTCAGCT

    Ignore the first AAG and start with C. Now, each base of DNA can represent a 2-digit binary number, and each number then corresponds to a color, like:

    C = 00

    T = 01

    A = 10

    G = 11

    With this strategy in mind, the sequence CCCT would give 00000001 pixet (or pixel set), and so on as the sequence grows. This pixet would define the color of four pixels in the image. Thus, each base corresponds to a pixel in the image, and the base defines the color of the pixel in a 4-color image.

  • Now, lets come to the flexible strategy. To begin with, see the table again:

    flexibe strategy table

    Here we are using standard 3-base codons. From the predefined value for each color (1 to 21), we can find the color using the codon. For example, from the same sequence:

    AAGCCCTGGTCAGCT

    Ignore AAG again and start with CCC. From the table, CCC encodes a value of 1. Move to next, TGG encodes a value of 16, TCA encodes 10 and GCT encodes 7, and so on for longer sequences. So, now we get an image with 4 pixels i.e. 2 x 2 with the pixels having color code 1, 16, 10, 7. In this way, each pixel can have a color from predefined values. On extracting this data, the image comes out as (from gizmodo):

image

The above part talked mostly about the single image of a hand. Now, talking about the horse-riding GIF, the process is almost the same. Here, we have to encode 5 images instead of one. Scientists encoded these 5 images in 5 different cells. After culturing them for some generations, they extracted the information of all images (using standard bioinformatics tools) and compiled them to get the GIF back. The initial and final GIFs look like this (from wired.com):

GIF

What do these rigid and flexible mean?

In this technique, the terms rigid and flexible are more about individual base rather than the codon. In the rigid strategy, the value of each base is fixed i.e. rigid. For example, in any sequence, C will encode the value '00', whatever the next or previous base is. This means that in both CCCT and GGTC, C has its rigid value '00'. So, for a 4-color image, where each base rigidly corresponds to the color of a pixel, we get as many pixels as the bases in the sequence.

On the other hand, in the flexible strategy, the individual bases do not have a fixed value, and the overall value of a pixet is defined by all the bases encoding that pixet. For example, TCC encodes a value of 6 while CCC encodes 1. The value of individual base is degenerate (or flexible), hence the name flexible strategy.

Thus, in a nutshell, while the rigid strategy is more efficient since one pixel is defined by one base (whereas in flexible strategy, one pixel is defined by one codon), the flexible strategy is better suited for getting more colored images since you get more color options by increasing the number of bases in a codon (whereas you only get 4 colors in rigid strategy, defined by 4 bases).

Why are we ignoring AAG?

As @canadianer points out in their answer, AAG is a PAM i.e. Protospacer Adjacent Motif. According to Wikipedia:

Protospacer adjacent motif (PAM) is a 2-6 base pair DNA sequence immediately following the DNA sequence targeted by the Cas9 nuclease in the CRISPR bacterial adaptive immune system. PAM is a component of the invading virus or plasmid, but is not a component of the bacterial CRISPR locus.

In simple terms (avoiding technical details), PAM is required for the CRISPR to function, but is not a part of the sequence itself. Much like a punctuation, it is necessary for proper functioning of CRISPR, but it is not to be read for encoding/decoding purpose. For the Cas9 found in E. coli (and is the most popular one), the sequence AAG serves as a PAM and is thus not used for encoding purpose here. Scientists also avoided to use AAG in their pixets so that there wouldn't be more than one recognition site for integration (ignore this point if you're unaware of the working of CRISPR).

Reference: Shipman, S., Nivala, J., Macklis, J. and Church, G. (2017). CRISPR-Cas encoding of a digital movie into the genomes of a population of living bacteria. Nature. http://dx.doi.org/10.1038/nature23017

another 'Homo sapien'
  • 14,121
  • 5
  • 60
  • 92
  • 2
    Just a note: The AAG sequence is a PAM for a specific Cas protein. There are Cas proteins from different bacterial species and they have different PAMs. – WYSIWYG Jul 14 '17 at 08:27
  • Why doesn't CAS9 read AAG? – Mockingbird Jul 14 '17 at 09:59
  • 4
    Nice addition, but there is no Cas9 in BL21. In this paper, PAM recognition for protospacer acquisition is mediated solely by the heterologous Cas1-Cas2 complex. Internal AAG is avoided so that there isn't more than one recognition site for integration. – canadianer Jul 14 '17 at 10:03
  • You might also mention the benefits of a degenerate code that are discussed in the paper, especially avoiding repeats and internal PAMs. – canadianer Jul 14 '17 at 10:05
  • 1
    A digital picture has many pixels on different sections. But is there any way to locate pixels of specific location of a picture on this method. Or the scientists designated different bacterias for different sections? – Mockingbird Jul 14 '17 at 10:06
  • @mockingbird AFAIK the only way is to count. Nope, scientists encoded one full image in one cell, only different images were incorporated in different cell. As for your first question, Cas9 does read AAG, but its more of a signal, so we don't take the risk of using it as pixet. See canadianer's first comment. – another 'Homo sapien' Jul 14 '17 at 10:09
  • @Mockingbird I think they just wholesale sequenced the entire CRISPR locus, which is really not overly interesting. To me, the neatest part of this research is how they used CRISPR to integrate the information into the genome. – canadianer Jul 14 '17 at 10:13
  • 1
    I don't understand what you mean by "wholesale sequenced the entire CRISPR locus". Do you mean that the entire CRISPR locus is encoded for one image? But an image has many pixels. How did they maintain the order? – Mockingbird Jul 14 '17 at 10:32
  • @Another Can you include a link to a paper regarding this phenomenon which is not behind paywall? – Mockingbird Jul 14 '17 at 10:49
  • 1
    @another'Homosapien' Yes, just finished ;) – canadianer Jul 14 '17 at 11:02
  • you write - Now, each base of DNA can represent a 2-digit binary number. Why 2-digit? why not 1 or 3 digit? – user1995 Jul 15 '17 at 08:18
  • @user1993 because there are only 4 bases, too many for 1 digit (2) and too few for 3 digits (8) – another 'Homo sapien' Jul 15 '17 at 08:58
  • But what is not clarified in the answer was the use of the word "GIF" - did the researchers actually encode the image in the CompuServe Graphics Interchange Format, or is "GIF" being used intentionally as a misnomer because it's a more familiar way of saying "animated image"? – oldmud0 Jul 16 '17 at 01:14
  • @oldmud0 I won't consider it either. They just broke the GIF (i.e. a group of images) into individual images and encoded it. Later, they extracted the data for individual images and merged it to get the GIF back. So, its neither being encoded in the GIF format, nor is GIF a misnomer. It does seem perfectly fine to me :) – another 'Homo sapien' Jul 16 '17 at 04:45
  • @another'Homosapien' Then, it would be more professional and correct to refer to a an animated image not as a GIF, but rather as an animated image, no? A GIF itself was not written to the DNA, nor is the format of the animated image important ("graphics interchange format") in the experiment. – oldmud0 Jul 16 '17 at 04:48
  • Frankly, yes it should be so. But people not so familiar with computers and formats often fail to understand the term 'animated image' since this is what they refer to by the term 'GIF' (ironic, I know, but I've seen this many times) – another 'Homo sapien' Jul 16 '17 at 04:52
  • 1
    @oldmud0 Without reading the paper again, I guess the correct description would be that they transcoded a GIF into their novel DNA code. – canadianer Jul 19 '17 at 06:47
6

Since a few people asked why the AAG triplet is avoided in the code, I thought I'd add this in addition to the other answers. The interesting part of this research is not necessarily the image encoding but rather how they utilized the CRISPR system to integrate the encoding DNA into the genome. It may be a surprise to some that the image is not encoded in one long string but rather, due to the nature of the type I CRISPR system of E. coli, in 33 base pair chunks called protospacers (of which 27 bases are used for the actual encoding, which gives 9 pixels per spacer). Thus the entire 30x30 pixel image required stable integration of 100 protospacers (though not necessarily in a single cell). These protospacers (oligonucleotides) were chemically synthesized and then introduced into cells by electroporation.

Integration of these protospacers into the genomic CRISPR locus utilized overexpression of heterologous Cas1 and Cas2 endonucleases. These proteins recognize exogenous DNA preferentially when it is flanked by a protospacer associated motif (PAM), which in the case of the CRISPR system in question is AAG. The complex recognizes the PAM and cleaves the exogenous DNA to form the 33 bp spacer which is inserted into the genome. Simplistically, it could be pictured something like this:

enter image description here

However, consider a situation where AAG is used to encode a pixel:

enter image description here

This creates an internal PAM that could lead to loss of information, depending on which PAM is recognized. Actually, the major benefits of having a degenerate code is to avoid certain triplet combinations that lead to internal PAMs or sequence repeats (which are error prone in replication).


References/Further Reading:

Amitai G, Sorek R. 2016. CRISPR-Cas adaptation: insights into the mechanism of action. Nat Rev Microbiol 14:67-76.

Shipman SL, Nivala J, Macklis JD, Church GM. 2017. CRISPR-Cas encoding of a digital movie into the genomes of a population of living bacteria. Nature.

Wang J, Li J, Zhao H, Sheng G, Wang M, Yin M, Wang Y. 2015. Structural and mechanistic basis of PAM-dependent spacer acquisition in CRISPR-Cas systems. Cell 163:840-853

PS: For anyone that cares, those images are not technically correct but, at the moment, I don't feel like changing them. In reality, the PAM is not part of the processed spacer.

canadianer
  • 17,692
  • 4
  • 49
  • 84
  • Good enough, +1! Yet I feel you should expand the second paragraph a bit :P – another 'Homo sapien' Jul 14 '17 at 11:06
  • @another'Homosapien' I tried to avoid too much mechanistic detail since I expect a lot of the people interested in this question are not extremely well versed in the intricacies of CRISPR-Cas (and neither am I, for that matter). I'm open to suggestions, though. – canadianer Jul 14 '17 at 11:13
  • 1
    Without a little jargon, how is anyone supposed to assess the credibility? ;) – canadianer Jul 14 '17 at 11:39