A farewell to C. elegans

C. elegans is a nematode worm, small enough that you can barely see an adult unless it’s placed right up against your eye. C. elegans is also where I got my start in analyzing extremely large biological datasets. It turns out that the little worm is quite the big data organism.  It doesn’t eat very much. It reproduces quickly; and it doesn’t live for very long. This facilitates collecting lots of information about the life cycles of large numbers of individual worms.  It’s rare that I ever thought about fewer than a hundred worms at a time, the average worm, so to speak.

My love for numbers tended to seep into my approach to C. elegans.  I can say from memory that the adult hermaphrodite has 959 cells and the male 1031 cells.  Of these cells, 302 are neurons in the hermaphrodite and 381 in the males.  The genome has 6 chromosomes containing 100 million base-pairs (about one thirtieth of the size of a human).

This love of numbers came in handy, because I primarily worked with C. elegans genomic data. My numbers came in sets of 100 million at a time; and so, understandably, I often had my head down in the weeds, tossing around huge chunks of information.  It was absorbing enough, that I rarely thought about the worms in concrete terms, only as a means of producing numbers. If I couldn’t detect a phenomena in the numbers, for me, it might as well never have existed.

One day, a coworker asked me if I’d ever seen naked worm DNA. I hadn’t.  She held up a vial of wispy, white material suspended in a clear liquid. I asked how many worms where needed to make it. Tens of thousands! For some reason, my thoughts shifted to a cheesy movie with a wizard or mad scientist extracting the essence of thousands of souls, holding up a glowing vial to the light, and cackling with self-satisfaction.  We were in the business of turning this goop into insight. That tiny amount of material contained enough information for months, if not years, of analysis.

I found that the wispiness of the physical DNA had its counterpart in the numbers. Often the conclusions produced by crunching large amounts of data are elusive.  It turns out that averaging millions of numbers, over thousands of individuals, in thousands of states of being, can lead one into a morass of truths that can only be expressed through probability.  I had the feeling of sitting in my quantum mechanics class, where everything was described in terms of interacting probability distributions.  It was hard to nail anything down!

When one decides to supersize one’s data, all the processes of data aggregation and analysis become directly linked to the insight that can gained about the real world.  The price one pays, for letting one’s data grow beyond the capacity of the human mind to absorb, is the study of the data with all its idiosyncrasies becomes the science.  It’s necessary to do a lot of filtering of artifacts, to identify and tame the quirks and biases of collection and analysis. This diffuseness is the soul of big data.  The data scientist hopes that by plunging deeper and deeper into the patterns of big data, one can fish out a few pearls of certainty.

Part of data science is getting to know new domains of data intimately, but moving on when a new dataset calls. So, I’m done with C. elegans, but the experiences with that data have become integrated into my understanding of how data behaves in the real life. I still work on sequencing data, but now I work with human beings, patients and cancer. It’s a whole new world.

Pragmatism in Big Data

Pragmatism is a rejection of the idea that the function of thought is to describe, represent, or mirror reality. Instead, pragmatists develop their philosophy around the idea that the function of thought is as an instrument or tool for prediction, action, and problem solving. Pragmatists contend that most philosophical topics—such as the nature of knowledge, language, concepts, meaning, belief, and science—are all best viewed in terms of their practical uses and successes rather than in terms of representative accuracy.”

I am hoping to write something about the relationship between big data and pragmatism, but I want to get through Pragmatism by William James first, which admittedly is not a very thick book!

I’ll give a sketch of my thinking. I would argue that it’s generally too optimistic to believe that explorations into big data will get one to the ground truth.  The truth is most easily accessed via natural experiments and careful investigation by humans.  This is not an approach that scales well to billions or trillions of observations, where we are forced to use statistics and artificial intelligence as a substitute for human intuition.  However, if one can build models that continue to approximate the truth even as the assumptions of the model are partially violated, such models are an acceptable outcome of any big data investigation. In other words, pragmatism is the most realistic philosophical perspective for the practicing big data scientist.


Snapple Fact

Snapple Fact

Moving to Boston; New position at Harvard

What I’m doing these days.

I recently accepted a position as a Fellow at Harvard University.  So, I am in Massachusetts these days.  I spend on the Harvard main campus mostly, although sometimes I’m at the Medical campus in Longwood.   It’s all very new and so it’s much, much too early to claim there is any pattern yet, or that whatever patterns there are will persist.  Although, it’s nice to be able to move around between both Campuses.

Continue reading

Don’t be too Risk Averse

This is a repost of my most recent article on the American Mathematical Society’s Graduate Student Blog

Graduate students are notoriously frugal. Although this is sometimes an economic necessity, in some ways I think we are a little too frugal. One of the ways that this frugality manifests, and that I find most prevalent, is risk aversion. This can be hard to detect because we are an adventuresome bunch. Although, perhaps more button-downed than the general college lot, we will try new foods, learn new languages and travel to new spots. However, in things that matter to our subject of interest, we tend to show our conservative side. I believe we ought to be more interested in doing unconventional things like entering contests, starting blogs, writing articles, making friends in other departments or collaborating with friends in other departments.

Continue reading

Mathematics Visualization

First, I should make long-overdue mention of a visualization done by Rachel Binx using some of the data I had for tagging of papers as a way of figuring out the relationships between different mathematics fields. I’ve never met her, but she is apparently “a feisty young woman operating out of the bay area.”  She was kind enough to let me know about it last year.  So now, I’m letting you guys know about it (all three of you).  The visualization is interactive and you can check it out by clicking here.

It can be hard to keep up with a blog especially when you have a lot of other things going on … for instance, being the Editor-in-chief on a second blog. So, I will try not to neglect this one. I will also, probably, do something cheap on occasion like link to posts on the other blog.

New Data Set

Paul Ginsparg recently provided me with a truly awesome data-set concerning the mathematics papers at the arXiv.org.  I am thinking of ways to put it to use in making new mathematics illustrations and hopefully improving on my previous project.