I recently went to a conference where there was a panel discussion titled ‘Data Driven Research/Integrative Biology’. The talk was what you would expect it to be: a group of bioinformaticians fielding very vague questions regarding cultural truisms of the Data Science/informatics space. In truth, the thing that really irked me about this talk wasn’t the content, but the title. It may seem silly to have to say this but Science is inherently data driven. Our decisions are based on– guess what: Data! It guides our intuition and informs our paths of inquiry. Saying we need to make Science more ‘Data Driven’ is like saying we need to make oceans ‘wetter’. There’s been seemingly endless talk and hype in the last few years regarding ‘Data’, particularly in the Internet/Technology world. Words like ‘Big Data’ and ‘Data Driven’ are ubiquitous to almost every article and blog post. But in the space of web development and software engineering, making decisions based on data is a new(ish) thing. This data revolution we’re living through right now is nothing more than the application of the scientific method to software design.

But it’s easy to criticize, hard to create, and I don’t intend to digress much more. Armed with the reminder of the amazing power of science a discovery platform – using data to drive everything it does –  a more appropriate title for the talk, would have been: “Modern Data Science tools for Biologists and Clinicians and how to leverage them to aid in research”. We don’t need a reminder to ‘let the data drive your product’, we already do that(or at least we should be). We need to get better at working with it, particularly at the GB and TB scale. Learning how to leverage and use the data we collect the way technology companies do is paramount to pushing biotech and basic life science research into the 21st century.

Despite my ire at the semantics of the panel’s buzzworthy title, I think our goals are approximately inline: To educate biologists and other ‘non-computer’ people to the importance of using better tools in their work. That being said this is what some of the data workflows look like for Perlstein Lab’s high-throughput screening efforts.

Problem 1: My Data is Messy

The first major hurdle in the day of an informatician is dealing with the outputs of various machines and software packages. Data Wrangling, as it’s also called, is easily 75% of the work. When people talk about the amazing interesting things data scientists do- they often neglect to mention or recognize the dramatic percentage of time spent cleansing, organizing, and reshaping data before you can even begin to do any kinds of analysis. At Perlstein Lab, we primarily use the Python package Pandas (a part of the SciPy stack) for most of this work. At this point, I suspect some programming-inclined readers are raising their eyebrows and announcing to their computer screens, “Why not R!” or “Why not Matlab!”. To which I reply: 1) I can’t hear you through the internet, and 2) I’m not going to get into the Data-Science-favorite-tools debate apart from one comment. R and Matlab are great, but Python, by and large,  has a much bigger community. When you accidentally break something or get stuck on something, there are volumes of information on StackOverflow or other internet resources to help you.

Problem 2: My Graphs are Ugly

In the mid 90s version of a biotech start-up you would have completed Problem 1 by futzing around with Excel for a few hours and then used the built in graph functions to make a graph with bad labels and a crumby color palette. Which is fine if that’s all you need, but more often than not these graphs aren’t dynamic, they take too long to make and customize and are a pain to export. For graph making that sits on top of our programming-oriented data workflows, I primarily use matplotlib or seaborn. But for most of our non-coding scientists these tools are a little bit inaccessible. Luckily there’s a slew of great data visualization and analytics tools out there. Primarily our scientists use Tableau for this purpose. Here is a graph I generated in negligible time using Tableau:

image01

It’s clean, has a good color palette, communicates information clearly and is easily exportable in higher res formats.

This is the same data graphed using Excel:

image05

The default colors are crumby, the default legend : graph ratio is disproportionate, and this graph took approximately twice the amount of time to create.

Tableau also has the added benefit of being able to make interesting, aesthetically pleasing interactive dashboards to track your various metrics:

image03

Problem 3: My data is too big

The last tool I’m going to talk about may seem a little high level for some readers, but bear with me because I promise that I’ve saved the coolest for last.

Say its the year 2000 and you’re Jim Kent trying to assemble the human genome. What might you do to accomplish it? The thing about assembling a genome is that it’s fundamentally a big data problem. The potentially apocryphal story I was told as an undergrad at UCSC was that David Haussler locked Jim Kent in a garage for 72 hours while he ran the GigAssembler program. Nowadays, there is a litany of fantastic tools for processing large quantities of data, limiting the need for grad advisors to imprison their students.

The main technical challenge with ever increasing data sizes, is that fundamentally you reach limits on your ability to process it. The last few years has seen an explosion in the number of tools designed to parallelize data across a group of computers and then query it as needed. The most common tool in the landscape is Hadoop. Most big data tools are built ‘on-top’ of the Hadoop framework and by extension inherit many of its flaws. There is, however, one new tool that I am particularly excited about. Spark is a general purpose big data platform that came out of a project at UCBerkeley’s AMPLab and was made part of the Apache Foundation in 2013.

Traditionally, once you had gotten your data into some distributed form across some number of computers, you would write a MapReduce job to perform some analysis. MapReduce jobs are nightmares to write. In truth I’ve only written a handful of them because it’s such a pain. Spark brings an elegant API for interactively querying your data in a real time REPL.

This is an example of some simple MapReduce code to count all the words in a textfile:

image02

This code is a nightmare to read, and if I had written it I would probably say it was a nightmare to write.

Here’s the same program written for Spark:

image04image00

“Ah ha!” you’re probably thinking, “If it’s easier to write, it’s probably a lot slower to run!” This assumption is incorrect. Spark generally outperforms traditional MapReduce.

But why should you care about this semi-obscure big data technology? Well in many ways I’d say that the other tools I’ve talked about could accomplish the same jobs at smaller scale, but therein lies the rub. They are limited in scale. To really make advances in science and biotechnology we need to make concerted efforts to utilize the weight of our data sets. The scale of data we are able to acquire in a clinical and laboratory setting is increasing dramatically. A single human genome is ~3.3×10^9 base pairs. This is roughly 350 GB of space before assembly. If we’re going to stay ahead of it, there needs to be a mass adoption of tools that make that possible.

For the last 15 years, biologists(and other types of scientists for that matter) have been able to skirt by with very 20th century technologies. These tools are becoming glaringly antiquated (I’m looking at you Microsoft Excel users). Getting biologists to think about modern technologies is critical to modernizing and scaling life sciences. If not, we run the risk of delaying scientific progress, slowing the search for new cures, and postponing new discoveries.

 

Image: (c) Can Stock Photo / sergunt

Translate »