<< Back to Podcasts

Claudio Cantù headshot

When is a Peak a Peak? (Claudio Cantù)

Episode 105

July 27, 2023

In this episode of the Epigenetics Podcast, we talked to Claudio Cantù from Linköping University about his work on peak blacklists, peak concordance and the burning question: what is a peak in CUT&RUN.

Our host Stefan Dillinger and guest Claudio Cantù dive into the topic of when we can be sure that a peak is a peak. To help with this, Claudio Cantù's group has been working on defining a set of suspicious peaks that can be used as a "peak blacklist" and can be subtracted to clean up CUT&RUN data sets. The lab also worked on a method called ICEBERG (Increased Capture of Enrichment By Exhaustive Replicate aGgregation) to help define peaks from a number of experimental replicates. By using this algorithm, the team is trying to discover the beta-catenin binding profile, not the tip of the beta-catenin binding iceberg, but the whole of the beta-catenin binding profile.

Read the transcript

Stefan Dillinger:
[0:23] Hello and welcome to this episode of the epigenetics podcast. Today I'm happy to welcome Claudio Cantu from Linköping University on the show. Please let me briefly introduce you to our audience. Please let me briefly introduce you to our audience. You obtained your PhD from the University of Milano and in 2011 you joined the developmental genetics lab of Professor Dr. Konrad Basler at the University of Zurich, Switzerland for postdoc. Then in February 2018 you moved from Switzerland to Sweden where you started your own research group and you who are still there today as a senior associate professor. A question I'd like to ask every guest to start off our podcast is, how did you become interested in biology in the first place and then in pursuing a career in science?

Claudio Cantu:
[1:06] All right. I'm glad I received this question as a first one. Like, I would wonder how someone could not be interested in biology in a way. And, well, I think that I was interested in biology as a child primarily, And this thing never went away. And then as a child, I was typically interested in dinosaurs and sharks, but then something happened when I was 19, 20, and I studied genetics at the university.
[1:43] When we were studying the experiments in, like the genetics experiment in the Drosophila melanogaster at the beginning of the 20th century from Thomas Hunt Morgan and his young students. And I somehow in that moment, I think I switched my interest to the kind of green biology, biology of animals to molecular biology and genetics. And I think that what most fascinated me, was that we could understand in details things that are extremely complicated, but which we cannot see. We cannot even see in principle, like DNA is a size that is smaller than the wavelength of light, but yet we understand how it's structured and how it works. And this was just, I think, amazing. So in that moment, I think that I wanted to study like marine biology or ecology, but then I switched immediately to interesting genetics and molecular biology and this never went away.

SD:
[2:55] Coming to a science that centers around the mechanisms via which cells influence each other's genomes during development, which is very broad. This episode might be a little bit different than others because I hope we can have an open discussion and I want to get your opinion on a topic that might be interesting to many. There are two papers on bioRxiv and this is also how I became aware of you and your work and wanted to invite you to this podcast with the title, the Cut-and-Run Blacklist of Problematic Regions of the Genome and Exhaustive Identification of Genome-Wide Binding Events of Transcriptional Regulators with Iceberg. So let's dive into it. But let's start with your work optimizing Cut-and-Run and calling this Cut-and-Run Love You. Can you talk about your motivation or the need to modify the original cut-and-run approach and what advantage this method brought in for you?

CC:
[3:47] Yeah, yeah, this is very interesting, for me at least. So when I was a postdoc at the University of Zurich in the lab of Connie Basler, something that I... So I was working on mechanisms downstream of wind signaling, so I wanted to look at what are the genes that beta-catenin, the sort of the the fulcrum of Wnt signaling regulates directly. And then I was doing ChIP-seq on beta-catenin. So ChIP-seq, like the gold standard technology to look at transcription factor binding genome-wide. The problem is that beta-catenin maybe has a very dynamic way of action, or it doesn't directly bind the DNA, or for some reason it was very difficult to detect with ChIP-seq. And then I spent basically two years of my life as a postdoc to have to kind of use information from other people who were successful with this and try to adopt a ChIP-seq protocol and profile beta-catenin binding from in vivo tissues in the mouse. And then we were successful to some extent in the sense that we got the datasets, which were good, quite noisy, but sort of also the reviewers accepted them because everyone knew that it was a very difficult target.

SD:
[5:08] So what were the factors that you optimized? Is it the fixative, the fixation time, or what did you change?

CC:
[5:18] Yeah, so there was a lot of choice of antibodies. Choice of antibodies and the I think the trick that made it work, the tricks that made it work had to do with the amount of of cells and with the cross-linking quality and quantity. And for this, I was inspired by a paper from the Hans Klebers group. And then what really worked was the combination of two chemical cross-linkers.
[5:52] One gets usually done, usually use this formaldehyde which cross-links everything with everything and adding another one, which was specific protein to protein cross-linker. And this very likely made the experiment working because then beta-catenin doesn't bind directly. DNA relies on other transcription factors to do so. So then we were cross-linking beta-catenin to those proteins that then in turn bind DNA and which are cross-linked likely via formaldehyde. The other one is the trick that everyone knows in the ChIP-seq world, which is using as many cells as you could, which, however, was also a problem because I aimed at working in vivo using mouse embryos. And then what me and Dario Zimmerli, PhD student at the time at the University of Zurich, were doing, it was a nekaton. We had to sacrifice a lot of pregnant female and sacrifice many hundreds of embryos to do a single experiment. And this also, of course, poses like the ethical dilemma. It's like, is it worth it?
[7:05] So from there, then I moved in 2018 and I set up my lab at University of Linköping in Sweden. And I wanted to have some sort of a ChIP-seq centered lab. The reason for this is because I think it's a primarily kind of an interesting passion driven choice in the sense that they really like to see the gene regulation events at the interface between the regulators, which are the transcription factors, the chromatin regulators, and the DNA, the genome. And I also think that experiments like ChIP-seq or Katerin now allows you to have the highest resolution solution of what we define being a molecular mechanism, which is always this vague concept that journals and editors and reviewers want from you.
[8:06] So basically, at that time, I wanted to establish a ChIP-seq-based project. And then I've read the Hanikov paper in eLife about establishing the Kateran. And I thought about this sort of dual advantage that Kateran has, which is it doesn't require cross-linking. And you can have a ChIP-seq-like profile, but with very small number of cells. And I thought I need to change immediately. It was not an easy choice for me because at that time I had already invested quite some money. For example, I paid $50,000 of my initial grant to buy a Covaris ultrasonicator, which is a great instrument required for cheap. And the choice would make this instrument almost obsolete. But then I thought it might have been a winning choice. And I think it has been a winning choice. The problem was that I suffered to tune ChIP-seq for beta-catenin. And then when I was working with my initial coworkers, like, for example, Mattias Spernebrink, who is now at the university, is at the Institute of Molecular Pathology in Vienna, we couldn't make Kateran work on beta-catenin. So I thought that my former investment and my current investment of moving from ChIP-seq Kateran was the wrong one.

Overcoming Challenges in Cross Linking Methods

SD:
[9:31] When you're when you do like massive cross linking and you and this made it work right and then you move to a method that omits the cross linking I mean it could be difficult right Yeah it could be difficult yeah maybe with hindsight it was probably obvious so and I know what are the reasons why we insisted rather than going simply back to ship.

CC:
[9:58] Maybe a little bit of serendipity, maybe also the persistence of the person who was working with me at the time, Mattias and then Gianluca Zambanini, a fantastic student from Italy. So they were persistent and they found ways to make Cateran work for beta-catenin. And then we have, we kind of established this slightly modified protocol. And what I liked about their work was that they really thought through the reasons why the experiment targeting beta-catenin was not working. And they tested those hypotheses until they found the one which presumably was the culprit cause of why targeting the Decatinni with Cataran did not really work.

SD:
[10:51] What was the key fact that made it work ultimately because I think what this is like, The problem with some of the transcription factors that are not working in cut-and-run and cut-and-take might be the reason why beta-catenin then worked in cut-and-run.

CC:
[11:08] Yeah, yeah, exactly. So let's say that we had a few hypotheses of why beta-catenin could not work. My original idea was that beta-catenin being relatively distant from DNA, as it doesn't bind it directly, would not allow micrococcal nuclease that using cataract to cut the DNA.

SD:
[11:29] Because it's too far away?

CC:
[11:31] It could be, I thought it could be too far away. The PAMNAs doesn't reach DNA when it's directed to beta-catenin. For example, we try to solve this by using primary and secondary antibodies. We thought we extend the reach of PAMNAs And this never really gave us good results.

Utilizing nuclear extraction for positive results

CC:
[12:00] What ultimately could give results, of course, was using nuclear extraction because beta-catenin goes into nucleus, but there is quite some beta-catenin, of course, on the membrane and then the cytoplasm. And we thought this titrates away, binds the D and sequestered the antibody.
[12:16] But I think that the main leap forward to positive results was when we started doing in-situ denaturation of proteins. So at the end of the cataract procedure, when you are meant to elute your DNA fragments before you do library preparation for next-generation sequencing, we worked under the assumption that maybe beta-catenin is in the a relatively big transcriptional complex, and the DNA might be cut left and right to the transcriptional factor or the transcriptional complex, but it's somehow trapped into the protein complex. So we started giving urea to this step in the reaction. And this, as a matter of fact, result in a huge yield of bigger DNA fragment, bigger DNA fragments that were released. And when we're collecting them, they would map to the binding profile of beta-catenin, which we knew already, of course, from, for example, ChIP-seq experiments that we have done.
[13:24] So this allowed us to identify that basically if you do in-situ protein denaturation, it really looked like that those big protein complexes trap some DNA, so some DNA remain stuck within those, and then you can release them by simply denaturating the complex. So this is now our favorite explanation of why the experiment worked.

SD:
[13:46] So it's not that the primary antibody in the MNAs would not cut or would not go there where the protein is binding, but that the complex after digestion would not release the DNA that you ultimately cut.

CC:
[14:00] Yeah, yeah, this is what we now think was the case, because it was urea who gave this leap of release in DNA fragments.

SD:
[14:09] Interesting.

CC:
[14:09] And of course, we could also measure that the size of the fragment released with urea is typically larger. And this sort of gives an intuitive sense that this might be the case, that those are more likely remaining trapped into protein complexes and so on.

SD:
[14:30] So the ENCODE project and also others have compiled a blacklist for ChIP-seq, which have been widely adopted. Lists contain regions of high and unstructured signal regardless of cell types or protein target and you set out to do the same for cut and run. So what was the initial situation you you found yourself in when starting with this.

CC:
[14:53] Yeah, of course, we have been using, every time we do a Cataran experiment, then we have been using the previous blacklist. So for Chip? That is used for Chip, because the reasoning is that those regions, the regions that are present in a blacklist have an obscure origin, in the sense that many are recognizable as regions of the genome that are difficult to be mapped in principle because they are highly repetitive and they can be subject for example to mapping artifacts.
[15:34] So you can have a few duplicated regions or repetitive regions and then many of the amplified fragments in your sequencing library map there maybe as an artifact and so on. And then what you see as a matter of fact is a signal that looks like a peak. But then you would see this signal also when you do a non-antibody experiment or an IgG control or a non-related antibody, targeting something that is not related to your biological hypothesis. You would see signal there. So then very wisely, I believe that someone had the idea to compile a list of regions that it's good to discard because they show signal, as again, the ChIP-seq blacklist might have included many types of artifacts that are mapping artifact that I suggested before, but they could also have been, for example, cross-linking artifacts. We cannot exclude that when you apply cross-linking, there are maybe big protein complexes somewhere in the genome that have nothing to do with the protein that you're trying to pull down, yet they get cross-linked, and for some chemical reason, they are preferentially, or they are enriched also in your pull down. So you find signal there.
[17:03] So many of the reasons why you get spurious signal, and they were very wise in compiling this ChIP-seq blacklist.

Discovering additional regions for the cut and run blacklist

CC:
[17:11] And we have used those data set to subtract it from our data sets. But then we realized that, so then we have many projects in the lab that essentially are Kata-run centered. And we realized that many of those regions.
[17:31] That are spurious signal and identified in ChIP-seq blacklist are also appearing in the Kateran experiment. So we thought it's wise to remove them. But then in particular, Anna Nordin, now she's a PhD student in my lab, she realized that there were other regions that appeared often in our Kateran experiments, but they were not included in a Kateran, sorry, in a ChIP-seq encode blacklist. So I have to admit entirely on her initiative, she compiled a list and she proposed this project to me. Of course, I was very enthusiastic because of the importance of the project. I mean, we all need to rely on the good tool made by ENCODE, which however, for Kateran, it's not complete because Kateran identifies new regions that are also found when we use an INGG, sort of pre-moon serum, like a mixture of unrelated antibody as a control. So we thought it's important to tell the community that those regions are, should raise suspicion at least.

SD:
[18:43] So to your knowledge, are you the first and the only one working on such a blacklist?

CC:
[18:48] No, I know that there are persons who might have been looking at compiling other blacklists. In fact, the bio-archive paper was now cited by another group, who I think has similar goals, But I believe that we were the only one trying to build a Catalan specific blacklist.

Building the Catalan "Blacklist" for Research

CC:
[19:16] Yeah, to my knowledge, this is the case. Of course, we are trying to be in contact with the Catalan community and I hope that they find our work useful. And to build the Catalan black list, which we now are considering changing the name, also upon suggestion of the editor, in which the paper now, the bioarchive article is now in revision, is to change the name from from blacklist to suspect list, also based on the rationale, which I somehow like, actually not somehow, very much like the rationale that our regions that display signal also when you use IgG antibodies, it's difficult to conclude that they are wrong signal. For example, some of those end up in promoters in promoters and we cannot, and even if we use no antibody or IHG, so there is a problem in those regions, but we should warn other users that they might be the scientists working on the, trying to target the factor which maybe regulates those promoters.
[20:33] So people should take also all the blacklists with caution because we can't exclude any region of the genome. It is actually regulated by some rare factor. Yeah, this is the story of our work. And we still use it now. After having compiled this, we have done literally hundreds of Kateran, and we always use this tool, which is made up by, so we have done like a reiterative experiment with IGG control, but we have compiled, of course, the blacklist using data set from others. We don't want to bias our blacklist with our own hands and experience.

SD:
[21:19] So this was... Because many questions I wrote down, you already kind of touched upon, but so how was the technical approach you took? So you took like data sets from your own lab and then also from other labs, and you kind of bioinformatically compiled them together and make kind of an average of all the peaks that might come up in all of them?

CC:
[21:38] Yeah, yeah, precise. This was precisely what we have done in a nutshell. So originally, like historically in the lab, we had done several IGG controls and we started compiling blacklist using those. And we noticed that if we would take 20 IGG controls that we have done and looked at all the peaks that are cold, let's say, for example, in a majority of the experiments.

Lab blacklist and improving datasets

CC:
[22:10] Then we would have a list of suspect regions that appeared in our own experiments. And if we used, and let's say called our lab blacklist. And if, when we apply this to other datasets, we could see that those datasets were improved in the sense that most of the dataset also included those signals, which then, Of course, we always rely on the fact of specificity. So we trust the peculiarity of our discovery when the signal is not found in the control. So in the very moment in which we noticed that in across the board of our experiments targeting our favorite transcription factors, the signal that we had identified in the blacklist that we compiled was present, then we felt the urge of excluding it. You know, this might be genuinely good signal, but because it appears in the control, it's safer to exclude it. So we want to remove false positives. And on the false positive concept, there might be a twist if we are curious to talk about the iceberg.

SD:
[23:26] Yeah, that's later down the road. I also wanted to get in, how does the cut-and-run blacklist compare to the ENCODE blacklist? So is there something that come up that is maybe interesting? Or is it just like a list of peaks that does not tell anything more interesting than just using it?

CC:
[23:48] Yeah, this is like, I would have to go in the wild world of speculations here because there are, so there are peaks overlapping with the ENCODE blacklist, but not all of them. So the Kateran blacklist doesn't identify some of the encode blacklist. And the Kateran blacklist is smaller. So it's a set of, a smaller set of peaks. And many are in regions that are, obviously when you look at the sequence and you're in a centromere and then sequence, there are a lot of repetitive regions and there are a lot of peaks there that appear across our AGG experiments and across the counter experiments of the others, other published datasets from independent laboratories.

Limited ability to see through DNA sequencing and regulatory regions.

CC:
[24:48] And those sequences don't tell me much, but of course, he is a person with limited ability to see through DNA sequencing and regulatory regions. Some of those regions, as I said, I believe we have, for example, one blacklist or suspect list peak in the GSK3 promoter, which is of course a very important gene. And then we get this peak across our experiments and we have to exclude it because it's present in the blacklist. However, we wish to warn, like caution in the sense that someone is bound to identify a factor which regulate GSK3, which is a gene that must be regulated on the promoter by transcriptional regulators. So also this is why I favor also the change of name from blacklist to suspect list, because I think that scientists also have to use those lists with cum granosalis, like a little bit of caution and look at those instances individually in each of your experiment.

SD:
[26:01] Yeah, I mean, it would be interesting to see if there is a regulation that the promoter does this blacklist peak change indeed, right? I mean, if it's still the same after, there is indeed a regulation, but it still shows up, it might be dangerous for the interpretation of this specific experiment then.

CC:
[26:19] Yeah, of course, of course. And the other aspect maybe that, to follow up your question about what we learn about on those blacklists is that this also upon a, I think a very genuinely good suggestion from a reviewer, we have looked at, And then we, so this will be published in another version of the paper. We have looked at the motifs that are the transcription factor binding motifs of the regions that we identified in the blacklist. And there is quite a lot of things going on. So you can think, for example, we identify CTCF motifs and then you can think, well, are we, is it CTCF binding that produces spurious signal and so on? It doesn't look like the case when we, for example, overlap the blacklist with our CTCF binding profile. And also when we looked at, so we have done the same analysis of motive search in the ENCODE blacklist. And also there we identify quite a lot of transcription factor binding signatures. But I don't think that we can learn much from this.

Motifs are degenerate DNA sequences, limited relevance in analysis.

CC:
[27:33] For the reason that motifs are degenerate DNA sequences. So a motif could be a bad motif and does not really allow stringent binding of transcription factors. And also motifs are short DNA sequencing. So if you, the more you expand the pool of the DNA sequencing you look for, you are screening the higher the chance of identifying by mere luck some motives. So I don't think that you necessarily learn something from those. The ultimate test would be, is that region actually bound by that transcription factor with that motive? And then we need to do a ChIP-seq or a Kateran.

SD:
[28:25] So sorry to conclude this section, Can you say something, and if you can't, that's perfectly fine, when this will be peer-reviewed and published? Because now it's on Pio Archive, but when this will be listened to in like a year from now, then yeah, it may be out somewhere.

CC:
[28:43] Yeah, yeah, it will be out. So it was, I believe I can say this now, so accepted as an article and it will be published, I think in a time frame of a month or something like this. And we've included all the peer review and mostly it was, I would have to say, great suggestions. And then we were happy to have also, This is a case in which..., You know, when I grew up also as a postdoc, I started being a little bit against the peer review system. They just simply denying me to publish quickly my papers, which are correct. And then of course, in many instances, I had to revise my opinion on it. And this is one of those cases in the sense that I grew up as a molecular biologist and a developmental biologist. And those projects, which are almost entirely computational, We saw, like, Anna, who is the leading author on this and working in my lab and myself, really wanted expert opinion on this. I mean, just please help us to understand if you're doing something wrong. And the very fact that, let's say, at least two reviewers out of three took the project with enthusiasm, then this made us also happy and like, more anxious in going ahead with aggregation.

SD:
[30:12] So next to this blacklist project, another question that might come up, or has come up, in one or the other experiment is, how many peaks are overlapping with cheap-seq experiments, and how is the overlap of peaks in biological replicates in cut-and-run? To solve this, you developed Iceberg, and this stands for Increased Capture of Enrichment by Exhaustive Replicate Aggregation, an experimental and computational, again, procedure that utilizes numerous cut-and-run replicates to discover the entire set of binding events and consistently exclude false positives. So maybe to start off a really easy question, when is a peak really a peak?

CC:
[30:53] Yeah, this is a million-dollar question whose answer, in my opinion, is no one knows. because no one really knows and what we have to do and the reason why I say this is because, the peak is identification of signal in a specific genomic coordinates.

Identification of Signal in Genomic Coordinates

CC:
[31:21] Now, as it turns out, for example, when you do CHIP, in CHIP you are purifying cross-linked genomes, and then you hope that via immunoprecipitation, you are enriching for those DNA regions that are bound by your transcription factor. Enrichment precisely means that there will be a ratio between the signal in that position and the left and right adjacent sites.

SD:
[31:54] The most famous signal-to-noise ratio, right? Exactly.

CC:
[31:58] This is the signal-to-noise ratio, which this needs to be higher than one in the sense that if the signal-to-noise ratio is one, in that position, your signal is equivalent to background, so you cannot draw any conclusion. So is the signal-to-noise ratio 1.1, 1.2, 1.8, 2.5, where is the number where we trust it as a signal, as long as it is reproducible. This is a genuinely difficult question that pertains to the signal detection theory of which I cannot say more in terms of math, but I only have an intuitive understanding of it. It's a problem of detecting anything, even if you are driving, piloting a plane and you have a radar and you spot a signal, this might be another fighter jet trying to bombard us, fighter jet trying to bombard us, or maybe it's a flock of birds. And then you need to distinguish these two signals. And this is a genuine problem in all fields of human enterprise where we need to detect something that is not something else.

SD:
[33:10] I mean, it's a problem of measurement, right?

CC:
[33:13] It's a problem of measurement. Exactly. So how people deal with this in in in ChIP-seq It's the, I think it's still the best way, is by, is two ways. One, statistics. We use statistics to impose thresholds. And then you can imagine this is a line just above the signal, in a line just above the signal across the entire genome in a ChIP-seq or Katerin experiment. Signal that goes above, we call this a peak. Signal that goes below, we call this not a peak. That's a great method. The other method is a replication. We do the experiment again and again, and then we say the peak in that position happened again, So I trust it. And of course, those are the two methods on which everyone is relying. And I have been relying and people should be relying also now, I think. On the other hand, there are there are problems.
[34:13] Like if you become somehow an inverted comma, an expert, I think, gypsy or Catalan in the sense that you are used to walk through the genome and scroll the genome to see this is a peak. This is not a peak. You sometimes recognize this is genuinely a peak because of its shape. It really looks like a chipset peak where you know there is a peak summit and this is the obvious result of the biological procedure. When a transcription factor in the position that is close to the peak summit is binding, then this looks like a genuinely good peak. Other peaks are obviously not good because for example, they are big squares like tall squares and they are obviously the result of PCR amplification of this.

The Challenge of Replicability in DNA Research

CC:
[35:02] Library species of DNA. But then often what you see is that, come on, that's really a great peak. It's called the cross replicates, but it's always a little bit below the threshold. So you don't call it.
[35:20] This is a common problem that everyone, I think in the field knows and you have to exclude it. But we are okay in excluding it because what we also we write in the paper. So scientists don't like being wrong. So if you have a positive signal, but we are not entirely sure, you just exclude it. You are okay to have false negatives. So that's a negative, but it could actually be a positive, but it's okay to call it as a negative. What you don't want is false positive because positive means I go around on the street, my paper, but then I turn out to be wrong. So we don't like that. And there is this sort of unwritten rule or common wisdom across scientists that we are are okay with the false negatives, but not okay with false positives. But then we started thinking that, like, if you, if I, if I'm a patient, and I go, and I have a pain here, and my stomach, and I go to a doctor, I would rather pay the cost of a false positive, in the sense that if I have a disease, but the doctor doesn't diagnose it, it, this is a false negative.
[36:42] Then I end up having a bad disease that is not diagnosed and I don't know it until it's too late. While if I have, if I am okay to risk a false positive, so like increase the detection rate of your tool, because then I will pay the cost of a second opinion or a second analysis to make sure. So the replication, I would not rely on the replication rather than the fact that my disease is not detected. So we were wondering whether this strict preference for false negatives over false positive is an idiosyncrasy of molecular biologists. I mean, we work in this field where this is the case, but is this a right attitude in all fields of human knowledge?
[37:32] Of course, we don't know the answer to this, but we sort of provocatively say, shall we try to be daring a little bit and maybe welcome some more signal and find another way to detect whether the signal is genuinely good or not? In other words, we are okay to exclude peaks on which we are not sure, but we know that we identify what- You worked hard enough to say that it's not a peak because you did some other controls. You did other controls, but sometimes you are left honestly with the doubt and you have big data sets that you need to exclude because they are within the set of unreplicability. If you imagine a simple Venn diagram with the two sets, the intersection set is what you trust, the two non-intersecting sets is what you say, I don't trust those, but those sets are often big. And this made us think that the intersection set, which is the, our discovery that we trust might represent the tip of the iceberg of the biologically relevant events.
[38:50] And then we thought, well, I want to discover the beta-catenin binding profile, not the tip of the iceberg of beta-catenin binding, but I want to discover all the binding profiles of beta-catenin. How do we do that? And then we thought that we should try to see what we say we can say from these two big non-overlapping sets. And then we improved replicability and computation by designing this iceberg, which was essentially designed by, Again, Anna Nordén, the person who compiled the blacklist, and also Pierre Francesco Bagella is a senior scientist in the lab. And Gianluca Zambanini, who is the original author of the Catalan LoveU protocol. By the way, LoveU was a real acronym, stands for low volume and urea. It's not only if and when.

SD:
[39:39] So the term you coined here was peak concordance. And if you compare the, so what does it exactly mean? And how does this compare then between like, again, cut and run and ChIP-seq?

CC:
[39:53] Yeah, I'm actually happy you asked that, because while writing the article, which we wrote as a team effort, the article entirely, we were writing, of course, overlap or reproducible peaks.
[40:15] But from a almost philosophical point of view, defining what is reproducible is precisely the problem we were trying to address. So we thought of using a new term to address the overlap.
[40:32] And we, we agreed on concordance, which is, it's somehow a neutral term that can define the outcome of our technology that spots events more than once across different replicates. And the different replicates are concordant to the extent to which they reproduce those events. So we thought that concordant can become a clearer way to define this identification of peaks that are reproduced in those experimental instances, which doesn't necessarily imply reproducibility because reproducibility for us often is a synonym of reality, a synonym of that is what our real discovery is. While the concordance could also be due to technical issues, if we don't subtract the blacklist or suspect list, those peaks will be concordant, but not because they represent biological events. They are the result of technical artifacts, most likely.

SD:
[41:49] Yeah, so how is then, how did you approach that? I mean, there is this one issue that you have a high peak concordance in cut and run, and a high peak concordance between cut and run and, for example, ChIP-seq, right? So how is this reflected? How did you see that?

CC:
[42:08] Yeah, so it's difficult for me to compare ChIP-seq and cut and run in the sense that we we didn't do this analysis thoroughly.
[42:20] What I can say is that historically I was doing ChIP-seq and the maximum number of replicates I have done for a study for a transcriptional factor is three. And I had always noticed that the overlap across replicates is relatively low. Like with the low, I mean, with the factors such as beta-catenin, as my favorite protein for a number of years, It can be 10 or 20%. Now, why you do one experiment and you get 5,000 peaks. And then another day you do again the same experiment and you get 5,000 peaks. But of these 5,000 on the left and 5,000 on the right, only a thousand are in common. The reason why this happened has never been clear to me. But then, of course, I was bound to just trust the overlap. And what I would typically choose to do is describe the figure and say, this is the result 5,000, 5,000, but they trust the high confidence, a thousand peak that overlap. When we were doing Kata run, the problem was similar and to it quantitatively to it on a similar scale.
[43:39] Now, so we had 5,000 peaks of Kateran, 5,000 peaks of Kateran in the second replicate, only 1,000 more or less would overlap. I'm talking again about beta-catenin or some difficult targets, because when you target, for example, it's something that we show in the Iceberg study, when you target chromatin marks, like post-translational modification of histones or other very stable transcription factors, the overlap is much higher. It's never 100%, but it's much higher. But then the overlap also with Catherine when targeting difficult targets like beta-catenin is typically low. What I don't know and what I meant at the beginning of my answer was the non-overlapping targets in ChIP-seq. I don't know how overlapping are those to the non-overlapping targets of Catherine. This is what we didn't find. We didn't look at thoroughly. But the reason why we didn't look at thoroughly this is because there is no iceberg equivalent, to my knowledge, of ChIP-seq, which is what we would need to compare the two technologies.

Determining the number of replicates for accurate results

SD:
[44:48] So how many replicates would you then need to confidently say, well, this is a peak for, let's not say maybe histone modifications, because you just said that those might be better or more stable or more often called in different replicates. But if you look at a transcription factor, let's say beta-catenin, how many replicates, be it technical or biological, how much is enough?

CC:
[45:13] So we don't have a definitive answer on this, but what we could clearly show is that by doing a reiterative replication of your experiment, and this was shown mathematically by Pierre-Francesco Pagella in the lab, that basically the discovery curve at some point plateaus. When you go, in the case of beta-catenin, when you go over and we have evidence for at least another factor that this is the case. When you approach the 20 replicates and you go over 20 replicates, for each additional replicate, the discovery rate becomes quite low. So, Pierre-Francisco was calculating the derivative of the curve, which approaches zero at the end. And this shows that you have plateaued your discovery, which in our metaphor, we have seen the entire iceberg.
[46:12] Only having this tool allowed us to benchmark sort of the full discovery rate, the iceberg, this full discovery set, with what you would discover if you were to do only two, one, two, or three replicates. And in the article, we provide an entire figure for this analysis, because we think that, first of all, we cut around, it's a relatively expensive and demanding experiment. I think that it's reasonable to expect that people would want to replicate their experiment twice or three times.
[46:56] And then what do you get with that? So you get the part of the iceberg, not the entire iceberg. I think that's entirely okay. You discover the most important regulation events or maybe the most common ones across the cell population that you are investigating and so on. And for example, what we have seen is that when you target beta-catenin, if you do one replicate or two replicates, your discovery rate could be relatively good depending also on the peak calling strategy that you design. So we looked at this in terms of recall and precision, where recall is the number of peaks that belong to the iceberg peaks that you actually identify. So the number of real positive across the total number of positives.
[47:55] And the precision is the number of how precise your dataset is, which means. Across the whole number of peaks that you identified, how many are real positive, then this really allows you to chart the line between false positive and false negatives. So we have seen that, of course, there is always a trade-off between precision and recall. And regardless of the statistical test that you use, you favor one or the other. But, you know, if you look at the figure in the paper, you can see that even if you get one experiment, experiment, it's figure two, right? It is figure four, the one I refer. And in those charts, we see that the chart, you know, this is a very simple chart. Admittedly, we were jokingly inspired by the presentation of the first iPhone by Steve jobs when it made the chart about the smartphone existing and iPhone is right on top on the right top part of the chart. And in the same, this is true for the iceberg. So the iceberg is the top right part of the chart with the perfect precision and perfect recall.

SD:
[49:18] I'm so sorry to interrupt you. Sorry to interrupt you. But could you just give a short summary about iceberg because we didn't touch on on on how it works?

CC:
[49:26] Yeah, yeah, you're right. So basically what Iceberg is, is a combination of experimentation and computation where we did reiterative replicates and we did 25, for example, 25 replicates of Kateran for beta-catenin, and then we...

Building a Big Individual Replicate for Data Integration

CC:
[49:48] For each replicate, we selected at random reads for each replicate and put them all in one bag such that to build a big individual replicate, essentially, that includes information from all the experiments.
[50:10] And the reason for the rationale behind this, was that the spurious signal, which could be due to random effects of mapping or purification of a specific DNA fragments would be averaged out if you sum that coming from a large number of experiment, such that your background becomes a more flat line a line across the genome, while the small signal, even if it's small and typically falls below the threshold of detectability in individual replicates, if it's real, it might be summed and accumulated such that it would go be above the threshold of typical peak calling statistics, and we might identify it. And this actually is what does happen. And we also try to explain the iceberg logic in a drawing. And I think that this is really the case that the background became a tamer background in the sense that we can really understand it better because it's more flat, while the real signal appear because this was the result of summing up all the small signals present in individual, in some, but not in all individual replicates.
[51:37] And then, basically, this allowed us to have to discover several peaks that were called, maybe in a very small number of replicates. For example, we do 25 replicates. If a peak is present in 25 out of 25, that's an easy choice. We like it. If it's present in 20 out of 25. Kind of, we like it. It's like 4, 5th of our experiments display that signal. This is quite a high chance.
[52:18] But then if you start looking at numbers of the signal present in 10 out of 25 experiments, do you trust this? This is a minority of experiments, but still, Katarana is a highly controlled experiment. You can detect that each experiment worked. And then this signal is present in 10 times. 10 times is a high number. Typically, no one does cut around 10 times. And then, so we thought that we should trust those and we should also trust the signal that maybe appeared in even a smaller number of replicates. And we started thinking that maybe the frequency of signal detection across replicates might simply underlie the probability of that factor to be identified there. And I think that this is an important concept because it's something that people say in the ChIP-seq field for 20 years now, which is, the peak is bigger when...
[53:23] The binding event occurs in a big portion of the population that you are, on which you are doing the experiment. Yeah, if, for example, it's possible that cells are heterogeneous. If a transcriptional factor is binding on promoter of gene X, only on 10% of the cells that you are purifying, then the peak will be smaller. These people have this feel since 20 years. And I think that with Iceberg, we capture this probability that could be reflective either of the fraction of cells in which the regulation event occurs, or it could be reflective of the time that the transcription factor or regulator spends on the binding there. And of course, we don't know what of these two possibilities but I think it's, only my opinion, is a beautiful hypothesis that the bands like screams for future investigation.

SD:
[54:23] I mean, having, like, looking at the first point you just raised, so it's only single cells that might have this factor bound at a given time in this experiment, and then a single cell experiment would obviously be the right experiment to do it then, right?

CC:
[54:41] Yes, of course. It is, of course, the single cell approaches are, I believe the future, and I hope that we don't have to wait too long. Now there are approaches to do in particular to apply cut and tag to single cells, and they've shown to be reliable from, and yeah, I mean, you're entirely right. This would be the ultimate approach. Of course, when you look at single cell approaches, if you look at attack sequencing or cut and tag on single cells, and if you have one cell, then you have per each cell in that specific chromosomal position, this becomes a digital problem in the sense that either you have signal or you're not, or you have your two alleles, so you might have zero signal, signal one or signal two, because you detect the binding on one or two alleles or zero. And. Reproducibility Challenges with Digital Events in Cell Identity
[55:55] I think a challenge in the future will be also to address what does reproducibility mean there because when you are relying on digital events to call out cell identity, so this cell, the blue cell is different than the red cell because it presents this digital event, Then you have a problem of reproducibility because we want to make sure that something is true if that event occurs again. While here, conceptually, we are distinguishing two cells because of an event that is individual for that cell. So when the events are digital, I think it's going to be, of course, not impossible, but more difficult to distinguish a real biological idiosyncrasy defining cell identity.

SD:
[56:56] Yeah, I mean, then other factors might come into play, like cell cycle and other things, right? Why is the specter here? Why isn't it here, right?

CC:
[57:05] So. Yeah. But of course, the single, the point that you raised, there is no better answer than saying you're right. We also look forward that the people who are working out on this will develop the technology that allow us to detect, to do a single-cell iceberg, for example.

SD:
[57:26] So just to finish off this part, what would you say, how many biological replicates are enough? How much or how many should a researcher be doing?

CC:
[57:35] Yeah, yeah, so.

SD:
[57:38] I mean, nobody can do 25 as you did, right?

CC:
[57:41] Yeah, so, well, people could do 25 in the sense that if, of course, not everyone maybe could afford doing experiments, but if you consider the sequencing costs and now the also reduced the cost of library preparation for next generation sequencing, now we are in the lab, we are often confronted with the choice. Should we do Kata run QPCR or should we just sequence? And then we decide to sequence because the gain of doing cut-around QPCR is not really substantial. I mean, if we do a full plate of a cybergreen-based QPCR, we pay more than the contribution of those few samples within a sequencing lane. So we would rather wait, it accumulates samples enough, and then include them in a sequencing because their relative contribution is lower than the cyber green plate. And of course, but then you have to add the costs of antibodies. I mean, of course, this is expensive stuff, but also if you want to do a staining on a new list of chemistry or immunofluorescence on a tissue section, then you need to buy an antibody, which costs $500.

SD:
[59:03] Also the cost, the cost in the amount of antibody you need for a cut and run experiment is way lower than for gypsy, for example, for example.

CC:
[59:10] Yeah, I'm glad you point this out precisely. So we say in the iceberg paper, we say, you know, we don't expect that everyone does 25 replicates. On the other hand, one shouldn't think of this as an impossible feat, because it's 25 replicates of Kateran. I hope I'm not saying something that is awfully wrong, but I don't believe so. It's kind of less expensive than an individual single cell sequencing experiment with the current technology that uses the appropriate controls and the appropriate number of replicates, which are great technologies and they are expensive. The cost will drop in the future, of course, but nowadays it would cost my lab less to do an iceberg of the next transcription factor than a single cell sequencing experiment, which we do.

Addressing the Importance of Replicates in Discovering Cell Identity

CC:
[1:00:04] I don't want to, I don't want to, so we do, and we love the single cell sequencing technologies. But to go back to your question, I think it's, so that's why we have figure four in the paper, because there is an entire, there is a world out there of people who are packaging their story with two or three replicates.
[1:00:26] And if I now to review an article for a scientific journal, I wouldn't say do 25 replicates, But I would urge the scientists at least to try to address at least to try to address the point of what is there.
[1:00:45] Fraction of discovery. How can they estimate how much of their own iceberg are they detecting? I think it's very important. I mean, all of us won't discover generally good thing. No one of us claims we discover the entire functioning of the universe. I discovered that this transcription factor regulate gene X and maybe gene Y, and this has these consequences. And then we don't know all the reality. And that's okay. And the iceberg could be really used as a tool to have an estimate of how certain am I about my discovery and likely how much am I losing it. And that's okay. I think it's entirely fine.
[1:01:30] And if I may add as a last thing, an important piece of data is that when you target chromatin marks, which is also an important part of the efforts from the community, like there are several chromatin marks and they define how the genome is used. Even when you do one replicate, this does pretty well in identifying the iceberg. And we have seen that across 10 replicates, the concordance, the overlap of those replicates, it's very high. And most of the peaks that we identified does occur in all of the 10 replicates. And this is spectacularly showing us that depending on the factor, you really have a different outcome and expectation in terms of how many times you have to do the experiment to be satisfied with your discovery rate.

SD:
[1:02:23] So just a small question to add. When you say replicate, this is a true biological replicate, right? So you start with a separate biological entity at the beginning, right? So you're not talking about some splitting up in between and doing some kind of technical replicate, But it's a true biological replicate then.

The ongoing debate: Technical Replicate vs Biological Replicate

CC:
[1:02:47] I don't know, I don't know, but in the lab, we think, thought of this very often. This is a discussion that is always open, but never closed, because we don't know how to address this point. What is a technical replicate? What is a biological replicate? When, for example, you are assessing, and you are doing a cataract experiment in a cell culture.

Addressing Technical Challenges in Cell Culturing

CC:
[1:03:18] And of course, there are reasonable opinions on that, but how we address the iceberg problem was from a technical perspective. We know that when we, because we want it to be even more precise and not give the chance that other biological factors which have to do with the temperature outside that now it's summer, it's changing and the cells inside our lab grow more might influence this. So what we did, we culture big amount of cells and you split them in, actually we did this in two rounds and then we split them in 10 or 15 samples such that we can have 10 or 15 of what we refer to technical replicates. And then two weeks later, we have done again other 15, and those are 15 technical replicates. One could think that the two experiments are two big biological variations, biological replicates. On the other hand, we, so Anna did look at, with the, you know, with classical, for example, principal component analysis, that you cannot really distinguish two groups.
[1:04:31] Yeah, this was very important. But yet, even when you do technical, the experiment should be exactly the same, but they are not. And we don't know why. And if you look at figure two, then what we showed there with a good dose of pride is all the 25 replicates, they all They all show that they work spectacularly well because we have very high signal and very low background, in the region where we know there should be signal. And they all look like each other. But if you zoom in, you see that there are differences in peak size also in this region. Why this happens? That's the problem we try to address.

SD:
[1:05:19] So is this what you are working on, right, like currently and for maybe the next five years that you're trying to address the technical and biological variability in those experiments?

CC:
[1:05:31] Yes, in part, not only, not only. So we started off, I as a scientist and my lab, as a whole, we started off to address biological questions, not really technical questions. And I'm also kind of a little bit, I'm happy with hindsight, that we are also dealing with technology, and perhaps a little bit philosophical issues on how to deal with the signal detection theory in molecular biology and genomics.
[1:06:10] So I'm admittedly proud of people in my lab who really can do that. But we started off with the genuine biological question, which I think it's what we are gonna focus on in the future. In particular, for example, we are going to use those tools to address the activity of beta-catenin and other factors, which for many years have been shown to operate in somehow a universal manner and conserved across the species, model organisms, and across cell types. But if you look at in details, when we apply like tissue-specifically those technologies, We look at proteins that, like beta-catenin, act in a very different way. And we are trying to understand first how, and then why. And this is what we are going to focus a lot in the future. But another thing that I learned in my still short career is that what I think is going to happen, it's not going to happen. And some other things will happen, like such as iceberg. I never anticipated that we would focus our efforts on such an enterprise.

SD:
[1:07:23] So Claudio, this is now officially the longest podcast episode we did and we have like 106 now. So thank you very much for your time and for being on the show.

CC:
[1:07:33] I'm very grateful for your invitation and I apologize. I become very talkative, I realize when it's about.

SD:
[1:07:40] It's great, it was great information that you shared.

CC:
[1:07:43] But thank you very much. I'm very happy that we did this.

 

Active Motif CUT&RUN Kit