We live in an age of information and data, and more is being generated every day. It's estimated that there are about ten trillion gigabytes of digital data on the planet right now, and about 2.5 million gigabytes are added to the total daily. An exabyte is one billion gigabytes, and exabyte data centers are currently used to store most of the world's data. These centers can be as large as several football fields and cost as much as one billion dollars.
Researchers are exploring the idea that DNA could be used instead to store data. Mark Bathe, an MIT professor of biological engineering, suggested that a DNA molecule that could fit into a coffee cup could store all of the world's data - in theory.
"We need new solutions for storing these massive amounts of data that the world is accumulating, especially the archival data," said Bathe. "DNA is a thousandfold denser than even flash memory, and another property that's interesting is that once you make the DNA polymer, it doesn't consume any energy. You can write the DNA and then store it forever."
DNA has already been used to encode images and text, and now Bathe and his colleagues have developed a method for extracting a desired file from a mixture of DNA. They did so using silica particles that are only six micrometers big, and are labeled with DNA sequences that describe the contents. The scientists demonstrated that the method worked by finding individual images that had been stored as DNA sequences from a set of twenty images. This approach could be used to store 1020 files. The work has been reported in Nature Materials.
The computers we use encode text, files, and other data as combinations of 0s and 1s. DNA can be used in the same way, but it has four nucleotide bases instead of 0 and 1 - A, T, G, and C. DNA is also extremely stable, it compacts easily, and tends to be easy to synthesize and sequence.
Unfortunately, generating DNA for this purpose would be expensive. Right now, it's estimated that it would cost one trillion dollars to write one million gigabytes. Bathe estimated that the cost would have to drop by about six-fold to be competitive with the current popular mode of storage - magnetic tape. But this price drop might happen within two decades or less, he suggested. Another problem is finding a way to sort through the data so the desired piece can be easily and quickly found.
Right now, PCR is used to find the right sequence - primers with a specific, known sequence can be used to amplify a larger target sequence in DNA in a kind of retrieval process. But there are problems with this method. For one thing, it uses up the DNA because it's an enzymatic reaction.
"You're kind of burning the haystack to find the needle, because all the other DNA is not getting amplified and you're basically throwing it away," Bathe said.
In this study, the researchers created the silica particles, which are each labeled with short DNA sequences or barcodes that correspond to the longer sequence in the DNA file. Barcodes correspond to labels like 'cat' or 'plane,' and the desired image can be retrieved by adding primers that correspond to certain labels. Primers like 'wild' or 'orange' might go with 'cat,' for example. The primers are also fluorescent, so they can easily identify the location of a match in the sample. It also allows the file to be extracted without harming the rest of the DNA file.
"At the current state of our proof-of-concept, we're at the one kilobyte per second search rate. Our file system's search rate is determined by the data size per capsule, which is currently limited by the prohibitive cost to write even 100 megabytes worth of data on DNA, and the number of sorters we can use in parallel. If DNA synthesis becomes cheap enough, we would be able to maximize the data size we can store per file with our approach," said study co-author and MIT senior postdoc James Banal.
The barcodes were about 25 nucleotides long. If two barcodes are put on every file, 1010 (ten billion) different files can be uniquely labeled, and with four labels, 1020 (ten quintillion) files can be uniquely labeled.
Bathe suggested that this storage method might be ideal for data that needs to be kept for a long time but isn't accessed much.
"While it may be a while before DNA is viable as a data storage medium, there already exists a pressing need today for low-cost, massive storage solutions for preexisting DNA and RNA samples from COVID-19 testing, human genomic sequencing, and other areas of genomics," Bathe said.