Three Ideas for defusing Weapons of Math Destruction

I originally wrote this article for Cathy O’Neil’s blog. I’m reposting it here for my blog reading audience. I was inspired to write it after I heard a talk (similar to this one) by Cathy about her book Weapons of Math Destruction. It’s a book about the sometimes extremely negative effects that algorithms can have on people’s lives in particular algorithms of social control. We ended up chatting after the talk and this blog article was born.

When your algorithms can potentially affect the way physicians treat their dying patients, it really brings home how critical it is to do data science right. I work at a biomedical research institute where I study cancer data and ways of using it to help make better decisions. It’s a tremendously rewarding experience. I get the chance to apply data science on a massive scale and in a socially relevant way. I am passionate about the ways in which we can use automated decision processes for social good and I spend the vast majority of my time thinking about data science.

A year ago, I started working on creating a framework for assessing performance of an algorithm used heavily in cancer research. The first part of the project involved gathering all the data that we could get our hands on. The datasets had been created by different processes and had various advantages and disadvantages. First, the most valued but labor-intensive category of datasets to create had been manually curated by multiple people. More plentiful were datasets that had not been manually curated, but had been assessed by so many different algorithms that they were considered extremely well-characterized. Finally, there were the artificial datasets that had been created by simulation and for which the truth was known, but which lacked the complexity and depth of real data. Each type of dataset required careful consideration of the type of evidence it provided for proper algorithm performance. I came to really understand that validation of an algorithm and characterization of the typical errors were an essential part of the data science. The project taught me a few lessons that I think might be generally applicable.

Use open datasets

In most cases, it is preferable that algorithms be open-source and available for all to examine. If algorithms must be closed-source and proprietary, then open, curated datasets are essential for comparisons among algorithms. These may include real data that has been cleared for general use, anonymized data or high-quality artificial data. Open datasets allow us to analyze algorithms even when they are too complex to understand or when the source code is hidden. We can observe where and when they make errors and discern patterns. We can determine in what circumstances other algorithms can be better. This insight can be extremely powerful when it comes to applying algorithms in the real world.

Take manual curation seriously

Domain-specific experts, such as doctors in medicine or coaches in sports, are generally a very powerful source of information. Panels of experts are even better. While humans are by no means perfect, when careful consideration of an algorithmic result by experts implies that the algorithm has failed, it’s important to take that message seriously. It’s important to investigate if and why the algorithm failed. Even if the problem is never fixed, it is important to understand the types of errors the algorithm makes and to measure its failure rate in various circumstances.

Demand causal models

While it has become very easy to build systems which generate high-performing black-box algorithms, we must push for explainable results wherever possible. Furthermore, we should demand truly causal models rather than the merely predictive. Predictive models perform well when there are no external modifications of the system. Causal models continue to be accurate despite exogenous shocks and policy interventions. Frequently, we create the former, yet try to deploy them as if they are the latter with disastrous consequences.

All three principles have one underlying idea. Bad data science obscures and ignores the real world performance of its algorithms. It relies on little to no validation. When it does perform validation, it relies on canned approaches to validation. It doesn’t critically examine instances of bad performance with an eye towards trying to understand how and why these failures occur. It doesn’t make the nature of these failures widely known so consumers of these algorithms can deploy them with discernment and sophistication.

Good data science does the opposite. It creates algorithms which are deeply and widely understood. It allows us to understand when algorithms fail and how to adapt to those failures. It allows us to intelligently interpret the results we receive. It leads to better decision making.

Let’s stop the proliferation of weapons of math destruction with better data science!

The summer of learning to learn

Written in response to a blog post by my thoughtful and multi-talented friend Kaitlyn Choi.

At the 86th Oscars, in his acceptance speech for best actor, Matthew McConaughey shared that his hero was the version of himself that he would be in ten years. This speech was very widely mocked.  Within the literally self-congratulatory tone of the Oscars, it was perhaps an inartful way of expressing a relentless drive to being better. The mockery generally overshadowed his point: that rather than focusing on the mastery we see in others, we should try to be the best version of ourselves.

On the eve of my return to life as a student, I’ve been reading a lot of books on learning. My summer reading list includes Mastery by Robert Greene, Peak by Anders Ericsson, one of the foremost experts on expertise, Grit by Angela Duckworth, Deep Work by Cal Newport and The First 20 Hours, a book about rapid skill acquisition by Josh Kaufmann. I also read How to Talk Like TED about expert level speech performance. Mastery explores the ethos of becoming good at what we do. Newport’s book delves into the mechanics of creating a routine conducive to focus. The First 20 Hours is a dilettante’s dream, an enthusiastic leap headfirst into the art of rapid skill acquisition. Duckworth discusses the individuals that keep going when the rest of us give up, a trait she calls grit, and how to become one of them. Ericsson, whose research was popularized by Gladwell as the ten thousand hour rule, brings us up to date on the latest academic research on the now classic field of expertise.

I’m glad everybody decided to write all these interesting books just as I was about to go to grad school! I will take all the help I can get! I’m still working my way through these books so I’ll probably write some more about it as I learn more. I’ll just leave with two thoughts:

1. I attended Angela Duckworth’s book tour event for the Harvard Bookstore. Her recipe for success was passion, practice, a higher-purpose and hope. The first three ingredients are pretty self-explanatory. She explained that hope was the belief that things would turn out alright, that our efforts would be rewarded with success. I thought this was interesting because it dovetails nicely with an article I read on loss of hope being a contributor to burnout.

2. Discussions of peak performance and rapid skill acquisition often concentrate on the nature of child prodigies. Summarizing  prior work with prodigies and their parents, Scott Barry Kaufman wrote that there were significant environmental factors affecting the extreme level of skill development found in child prodigies: existence of a domain that matched their talents, proximity to opportunities for learning, cultural support, recognition for achievement, access to training resources, material support from family members, often at least one parent completely committed to the prodigy’s development, family traditions that favor the prodigy’s development and favorable historical forces, events and trends. I can see parallels with what I’m reading in the expertise and skill acquisition literature. The First 20 hours, for instance, talks a lot about lowering the barriers to learning, gathering the necessary materials and blocking out a time for focusing exclusively on learning.

A Data Scientist Goes Back to Graduate School

After spending the last few years being a data scientist, I have come to the conclusion that the most vexing problems in data science are not exclusively computational in nature. In fact typically, computation is the easy part. Many of the worst problems that persist and fester in the field are the deeply statistical questions. How do we automate the creation of truly causal models? How do we build models that incorporate expert knowledge in a deep way, but still have all the benefits of statistical models?

I realized that I wanted to spend the next few years thinking deeply about statistics, especially as it relates to data science, biology and medicine; and the Harvard Biostatistics department seemed like one of the best places in the world to begin that journey. So. I’m off to graduate school in the fall. Wish me luck! I may very well need it!:)

P.S. I’ve said this before and it maybe not pan out this time either. However, I’m going to try to write more on the blog. Once, I’m a student again, I should have more time to ponder and explore, to analyze ideas and to synthesize my own. This should directly translate into more blog articles!

A farewell to C. elegans

C. elegans is a nematode worm, small enough that you can barely see an adult unless it’s placed right up against your eye. C. elegans is also where I got my start in analyzing extremely large biological datasets. It turns out that the little worm is quite the big data organism.  It doesn’t eat very much. It reproduces quickly; and it doesn’t live for very long. This facilitates collecting lots of information about the life cycles of large numbers of individual worms.  It’s rare that I ever thought about fewer than a hundred worms at a time, the average worm, so to speak.

My love for numbers tended to seep into my approach to C. elegans.  I can say from memory that the adult hermaphrodite has 959 cells and the male 1031 cells.  Of these cells, 302 are neurons in the hermaphrodite and 381 in the males.  The genome has 6 chromosomes containing 100 million base-pairs (about one thirtieth of the size of a human).

This love of numbers came in handy, because I primarily worked with C. elegans genomic data. My numbers came in sets of 100 million at a time; and so, understandably, I often had my head down in the weeds, tossing around huge chunks of information.  It was absorbing enough, that I rarely thought about the worms in concrete terms, only as a means of producing numbers. If I couldn’t detect a phenomena in the numbers, for me, it might as well never have existed.

One day, a coworker asked me if I’d ever seen naked worm DNA. I hadn’t.  She held up a vial of wispy, white material suspended in a clear liquid. I asked how many worms where needed to make it. Tens of thousands! For some reason, my thoughts shifted to a cheesy movie with a wizard or mad scientist extracting the essence of thousands of souls, holding up a glowing vial to the light, and cackling with self-satisfaction.  We were in the business of turning this goop into insight. That tiny amount of material contained enough information for months, if not years, of analysis.

I found that the wispiness of the physical DNA had its counterpart in the numbers. Often the conclusions produced by crunching large amounts of data are elusive.  It turns out that averaging millions of numbers, over thousands of individuals, in thousands of states of being, can lead one into a morass of truths that can only be expressed through probability.  I had the feeling of sitting in my quantum mechanics class, where everything was described in terms of interacting probability distributions.  It was hard to nail anything down!

When one decides to supersize one’s data, all the processes of data aggregation and analysis become directly linked to the insight that can gained about the real world.  The price one pays, for letting one’s data grow beyond the capacity of the human mind to absorb, is the study of the data with all its idiosyncrasies becomes the science.  It’s necessary to do a lot of filtering of artifacts, to identify and tame the quirks and biases of collection and analysis. This diffuseness is the soul of big data.  The data scientist hopes that by plunging deeper and deeper into the patterns of big data, one can fish out a few pearls of certainty.

Part of data science is getting to know new domains of data intimately, but moving on when a new dataset calls. So, I’m done with C. elegans, but the experiences with that data have become integrated into my understanding of how data behaves in the real life. I still work on sequencing data, but now I work with human beings, patients and cancer. It’s a whole new world.

Pragmatism in Big Data

Pragmatism is a rejection of the idea that the function of thought is to describe, represent, or mirror reality. Instead, pragmatists develop their philosophy around the idea that the function of thought is as an instrument or tool for prediction, action, and problem solving. Pragmatists contend that most philosophical topics—such as the nature of knowledge, language, concepts, meaning, belief, and science—are all best viewed in terms of their practical uses and successes rather than in terms of representative accuracy.”

I am hoping to write something about the relationship between big data and pragmatism, but I want to get through Pragmatism by William James first, which admittedly is not a very thick book!

I’ll give a sketch of my thinking. I would argue that it’s generally too optimistic to believe that explorations into big data will get one to the ground truth.  The truth is most easily accessed via natural experiments and careful investigation by humans.  This is not an approach that scales well to billions or trillions of observations, where we are forced to use statistics and artificial intelligence as a substitute for human intuition.  However, if one can build models that continue to approximate the truth even as the assumptions of the model are partially violated, such models are an acceptable outcome of any big data investigation. In other words, pragmatism is the most realistic philosophical perspective for the practicing big data scientist.

Moving to Boston; New position at Harvard

What I’m doing these days.

I recently accepted a position as a Fellow at Harvard University.  So, I am in Massachusetts these days.  I spend on the Harvard main campus mostly, although sometimes I’m at the Medical campus in Longwood.   It’s all very new and so it’s much, much too early to claim there is any pattern yet, or that whatever patterns there are will persist.  Although, it’s nice to be able to move around between both Campuses.

Continue reading