C. elegans is a nematode worm, small enough that you can barely see an adult unless it’s placed right up against your eye. C. elegans is also where I got my start in analyzing extremely large biological datasets. It turns out that the little worm is quite the big data organism. It doesn’t eat very much. It reproduces quickly; and it doesn’t live for very long. This facilitates collecting lots of information about the life cycles of large numbers of individual worms. It’s rare that I ever thought about fewer than a hundred worms at a time, the average worm, so to speak.
My love for numbers tended to seep into my approach to C. elegans. I can say from memory that the adult hermaphrodite has 959 cells and the male 1031 cells. Of these cells, 302 are neurons in the hermaphrodite and 381 in the males. The genome has 6 chromosomes containing 100 million base-pairs (about one thirtieth of the size of a human).
This love of numbers came in handy, because I primarily worked with C. elegans genomic data. My numbers came in sets of 100 million at a time; and so, understandably, I often had my head down in the weeds, tossing around huge chunks of information. It was absorbing enough, that I rarely thought about the worms in concrete terms, only as a means of producing numbers. If I couldn’t detect a phenomena in the numbers, for me, it might as well never have existed.
One day, a coworker asked me if I’d ever seen naked worm DNA. I hadn’t. She held up a vial of wispy, white material suspended in a clear liquid. I asked how many worms where needed to make it. Tens of thousands! For some reason, my thoughts shifted to a cheesy movie with a wizard or mad scientist extracting the essence of thousands of souls, holding up a glowing vial to the light, and cackling with self-satisfaction. We were in the business of turning this goop into insight. That tiny amount of material contained enough information for months, if not years, of analysis.
I found that the wispiness of the physical DNA had its counterpart in the numbers. Often the conclusions produced by crunching large amounts of data are elusive. It turns out that averaging millions of numbers, over thousands of individuals, in thousands of states of being, can lead one into a morass of truths can only be expressed through probability. I had the feeling of sitting in my quantum mechanics class, where everything was described in terms of interacting probability distributions. It was hard to nail anything down!
When one decides to supersize one’s data, all the processes of data aggregation and analysis become directly linked to the insight that can gained about the real world. The price one pays, for letting one’s data grow beyond the capacity of the human mind to absorb, is the study of the data with all its idiosyncrasies becomes the science. It’s necessary to do a lot of filtering of artifacts, to identify and tame the quirks and biases of collection and analysis. This diffuseness is the soul of big data. The data scientist hopes that by plunging deeper and deeper into the patterns of big data, one can fish out a few pearls of certainty.
Part of data science is getting to know new domains of data intimately, but moving on when a new dataset calls. So, I’m done with C. elegans, but the experiences with that data have become integrated into my understanding of how data behaves in the real life. I still work on sequencing data, but now I work with human beings, patients and cancer. It’s a whole new world.