Revolutionizing Genomic Data Storage: The Power of Blockchain

This interview was conducted in Spring 2023. It has been edited for brevity and clarity.

Laura Sophie Wegner:

Can you please briefly explain what blockchain technology is in the context of your work on storing and protecting genomic information?

Eric Ni:

The way I like to think about blockchain is that it is kind of like a shared document between a network of participants where every participant has the same documents. These documents are a blockchain, which is made up of data or transactions that are cryptographically linked together in the form of blocks in a way that, once you add any information, it can never be removed or altered. Since this network is constantly being synchronized, this means that the data on the blockchain and the blockchain itself is immutable and unchangeable. That is what is very useful because it also means that anyone involved in the blockchain can audit and check that the data on there is correct, meaning that it is signed correctly and that the links are cryptographically verified. So anyone can see that the information that they have is an indisputable record. This can be very useful in a healthcare context for a biomedical context. One example is in electronic health records when you want to determine the ownership of a piece of data of a health record. Blockchain is a way to do that. Another example is the one of supply chains, especially for drug supply chains. It can be useful for tracking indisputably the access logs, audit logs of various drugs, and where they are.

LSW:

What aspect of blockchain development does your lab focus on?

EN:

The focus of our lab is on personal genomics, which means understanding people's genomes, their DNA, and how we can use that information to inform their healthcare. So a lot of what we do is mine the genome for information about how a person might respond to a type of drug or how they might be susceptible to a kind of disease. So to do that, we have to store their genome somewhere, and we also have to share that information somehow with other people. The problem with that is that, as biomedical research progresses, we're getting better and better at mining more and more information about the genome. And therefore, the privacy risk increases over time, so we have to be careful when sharing the genome since it is full of so much identifying information. We want a resource that allows us to share the genome securely, but also allows us to still be able to do research or analyses on it. One way we thought of doing that is using blockchain. We use a private blockchain network where a patient is the data owner and the sole person who has the ability to give access to other people. We envision a world where patients give access to clinicians and researchers, and those people will be able to access their patients’ genome which is uploaded on a blockchain and shared. Since it's a private blockchain network, only those people will be able to see it.

LSW:

Is it your lab’s goal to have your technology be implemented in hospital patient data storage systems?

EN:

Yeah, that's a good question because we have so far only focused on the development side. We launched this blockchain technology more as a proof of concept, demonstrating the fact that you can put a genome on a blockchain. So far we envision that people just set up small, private blockchains when they want to do any kind of analysis on their genome or they want to share it with someone in the future. It could be as simple as opening an app on your phone, reading the blockchain that way, and then giving someone access to it, such as a genome sequencer who would then append that data onto the blockchain, and then when the patient is at the hospital when they're interacting with researchers or clinicians, they can then give people access on an individual basis. So it's not so much about letting hospitals use this technology and more about letting individuals use it.

LSW:

What is the SAMchain you created and how does it differ from traditional genomic data storage methods?

EN:

First, I'd say that most genomic storage methods for blockchain store the data off-chain, such as on a cloud server or some private computer network, where it is protected with some kind of password. The blockchain itself actually just contains a link to that off-chain resource and it is managing access permissions, so it states who gets to have the password to the off-chain data. How SAMchain is different is that it is the first one to store a genome on a blockchain where we actually store the genome on the blockchain itself. So the advantage is that the blockchain has a bit more security guarantee due to the immutability aspect of the decentralization of cryptographic links. This allows for easy and fast analysis of the genome because it is all right there on the blockchain. Of course that comes with a lot of its own problems, in terms of the amount of storage space it takes and how long it takes to create a blockchain in the first place. Those are the kind of issues that we are working on right now. I will say that those issues are big enough that our technology currently cannot be deployed into a product for people to use. But we are currently working on a new iteration of SAMchain where we can reduce the amount of space that genomic data takes up on the blockchain.

LSW:

Can you explain how blockchain technology addresses concerns around data ownership, privacy, and security to ensure that genomic data remains secure and under the control of the individual?

EN:

I would say that blockchain is very much geared towards providing an indisputable record of information, thus providing security to protect people’s privacy. It has to ensure that the ownership of the data does not change. So, you know, NFTs, non-fungible tokens, basically have the same idea. You have the data stored somewhere else and the NFT itself is linked to that data. But more importantly, the NFT is defining the owner of that data and the ownership is very important. Blockchain is very similar in providing that indisputable ownership of the data, but also allowing that owner of the data to manage access, meaning that they have the permission to append to the data, who else gets to access their data, and who else gets to share their data on their behalf if they wish to do that. In general, I would say that blockchain helps with establishing ownership.

LSW:

With your technology protecting DNA data, do you think that blockchain technology should branch out to patient data storage in hospitals and doctor’s offices? If yes, what challenges and opportunities do you see in this future step?

EN:

We explored that as well before. We had another paper describing pharmacogenomic data, meaning data on how drugs and genomes interact. We are currently working on a paper about storing genomic training certificates or biometric biomedical training certificates on a blockchain, but really, you can put any type of data on the blockchain and manage it. For every type of data there are unique challenges. For example, for biometric data, the type of data that you would get from biosensors, such as on an Apple Watch, can also be very large, but it is also constantly being collected. There are unique things you can do with this data, such as how much of it you actually store on the blockchain or you could actually analyze that data on the blockchain. You also have to think about what format you put it in when you're putting it on the blockchain, and that very much comes down to how you want to access it. For example, your genome is split into different chromosomes and then each one refers to a sequencing read so you can see exactly where [they] are. So depending on your data, you have to format and store that data in a certain way. There are a lot of technical details.

LSW:

Ignoring some of the storage and efficiency challenges of blockchain technology, do you think that such a blockchain storage system should be integrated into systems that hospitals and doctors already use to store their patient data or would this be an additional program?

EN:

I think that it might be difficult to just transition through blockchain one-to-one although you could certainly do that. But I think the amount of data is still a lot. As I said before, most people, like in the genome case, store the data off-chain and then they store links to the data on-chain. So I imagine you would probably do something like that with patient data. I think a lot of the problem is just that the adoption of blockchain technology can be kind of slow. And, you know, public trust in blockchain may not always be very high. Especially with all the scandals in the cryptocurrency world going on. I think in any case, it is kind of easy to just take any system that exists and make a blockchain version of it. But in terms of making it efficient, it is more difficult. Nevertheless, it is important to include that there is a privacy challenge with blockchain as well: Since data on a blockchain cannot be changed or altered, which is good for auditability, private information should not be put on blockchain carelessly. For developers, it is an important consideration to either keep private information locked off-chain, or as we do in SAMchain, in a private blockchain network.

LSW:

Would you personally say that, if patients and healthcare workers trusted this technology, that it would be better than the status quo?

EN:

Yeah, I would agree with that and I think that there has to be some kind of regulation around that as well. But otherwise, using blockchain, even just for its security guarantees, is always better than not having those security guarantees. Just having a storage location that is trustworthy and where the data cannot be changed is always going to be useful.

LSW:

How could regulations governing the use of genomic data impact the adoption and use of blockchain technology in personalized medicine and medical research, and what legal considerations should organizations keep in mind when implementing this technology?

EN:

To start explaining that, we should talk about the status quo of where those legal aspects are right now. Currently, there is no federal law governing genomic data, but there is HIPAA [Health Insurance Portability and Accountability Act], which concerns private or protected health information (PHI), such as your name or geographic location. If that kind of data is tied to your health data, then that data is protected. If your protected health information is removed from your data, then the data can be kind of freely shared. So in that sense, HIPAA is kind of a weak privacy law. By a federal standard, that means that genomes can be shared freely, which is not necessarily what you want, because that means that bad actors [might take advantage of it]. That is the reason why most genomic data is now kept behind protected access, meaning that access is only given to some people that have been audited or are selected in some way. That is an issue that our lab thinks about a lot. When you put genomic data behind protected access, it locks it off from bad actors, whoever is trying to use the data maliciously, but it also makes it harder for scientists to access that data sometimes. Or at least, there is a lot more of a bureaucratic process to get that data. As a result of that bureaucratic process, that data ends up being used a lot less. As scientists, our lab also thinks about how we can reduce that hurdle. How can we share this data effectively with other scientists, so that, you know, biomedical research can still progress but the data is still kept private and safe? Blockchain is one way to help that actually because we can control data access permissions more tightly and we can actually give ownership of the data to the patient. So in terms of legal considerations, currently, there really are none, especially for academic groups. Simply because we are generally unrestricted. I will say that several states are now implementing more strict privacy laws, as the EU is a good example of this with GDPR [General Data Protection Regulation], a more comprehensive data privacy law. But also, I think California was the first state to do this and also Connecticut recently, very interestingly, has a comprehensive data privacy law that covers not only the HIPAA protected data, but it specifies that there is private data and then there is sensitive data. Sensitive data covers things like genomes or biometric data, which is data that is not inherently revealing your identity that can be used to reveal your identity through various calculations. That is very interesting because it explicitly mentions that, under that law, genomes will be protected. Academic researchers do not really have to worry about it because we are not profiting off of people's data, but companies that are profiting off of their data, such as 23andMe or Ancestry, would have to consider what the people want when they are sharing their data, whereas right now, this is rather unrestricted.

LSW:

How do you anticipate that the law will adapt to protect the privacy of patients, while allowing healthcare practitioners to take advantage of blockchain technology for more precise patient care?

EN:

I think, in general, that the law has been very slow to follow with the progression of science. But, I think the recent data privacy laws done by California and Connecticut are going in a good direction because they are defining what sensitive data is and they recognize that you do not need to have someone's name or their identity tied to a piece of data for that data to be revealing of their identity. I think it would be good if these laws more explicitly said exactly what types of data are sensitive data and I think that there should be more regulations around such data to regulate who or how to make it more accessible to researchers, but not bad actors. For blockchain specifically, the fact that you can't remove data from it means that there should probably be some law regulating the addition of private data to a blockchain network. I can tell you that there are various privacy preserving technologies that we do on the academic side that can help with that. Our lab, for example, did one thing with this one project where we were able to take someone's genome and split it into the public and private portions where the private portion contains all the data that is identifying and that can be used to tie it back to your identity, whereas the public portion can be shared freely with other people, and they won't be able to tell who that genome came from.

LSW:

Would the private portion of the data be stored on the blockchain and the public part would be stored off-chain?

EN:

Yes, I am really glad you brought that up because that is actually the kind of direction that we are thinking about for the next iteration of SAMchain. If you are putting the private portion on a blockchain, then it has to be a private blockchain. Then, the public portion can be put anywhere and shared publicly. The intention with this is that the public portion is still useful for most biomedical research, meaning that you are not taking so much private information out that you are hindering the analysis of this genome. You are just taking out enough such that you cannot tie it back to an individual's identity. What you said was spot on, we could think about putting the private portion on the blockchain instead of the entire [data] because it turns out that this private portion is a lot smaller and therefore takes up less storage space.

LSW:

Why is it that the main challenge of storing patient data on a blockchain seems to be the problem of storage space?

EN:

The reason why storage is such a problem is because of gas fees. A gas fee is something you pay every time you have to make a change on the blockchain that everyone has to verify and synchronize across the entire network. That takes up computational work on the blockchain’s behalf and therefore you have to pay some kind of price. That price is usually a cryptocurrency fee, as in the case of Ethereum. But otherwise, it is a verification fee for computational work. When you put large amounts of data onto a blockchain, that means that more data has to be verified and therefore leads to very high gas costs.

LSW:

So technically, it is possible to store more data on-chain, but it would be very expensive?

EN: Yes, in fact, if you wanted to store one person’s entire genome on an Ethereum blockchain, this could become very costly. We estimate the cost per megabyte (MB) to be around $8000. Genomes can be anywhere from 5 - 500 gigabytes (GB) for raw reads, so this would mean millions to billions of dollars. This calculation is specific to Ethereum mainnet, so there are other blockchains that could be cheaper options. We therefore need to find a way to more efficiently store data on a blockchain, so that we can store as much data as possible with the smallest gas fees.

Laura Wegner

Laura Wegner is a staff writer for the Harvard Undergraduate LAw Review for Spring 2023.

Previous
Previous

Data Privacy and Legislation In An Age of AI: An Interview with Dayle Duran

Next
Next

An Exploration of Civil Rights Law with Jonathan Abady