Take it to the Crowd
When you have a question about your health or your finances, you go to a doctor or an accountant for advice; you figure they have the knowledge you need to get the answers you’re looking for. But what about when you’re wondering where to go for dinner in a new city? Rather than hiring an expert chef to individually rate each restaurant—a pricey and time-consuming endeavor—you’d probably find it far more practical and efficient to trust the recommendations of the thousands of local diners who’ve already voluntarily rated the restaurants online. Today, crowdsourcing—in which many individuals work toward the collective goal of narrowing down a large amount of information—has indeed made it easier to choose a good restaurant or pick a movie you’ll likely enjoy. But the concept has also found an application in areas of research where numerous scientists have collected far more data than they could ever analyze on their own.
By taking this data to the crowd, researchers at Caltech have found a way to engage the public while also allowing so-called citizen scientists to investigate a variety of research topics—from very tiny cells on Earth to massive star clusters in our galaxy.
The Solar Army Chemist Harry Gray and his colleagues at the National Science Foundation’s Center for Chemical Innovation in Solar Fuels (CCI Solar) are looking to answer one important question as quickly as possible: How can we tap the sun’s energy to power the planet? Gray and the CCI Solar group believe that the answer will involve using solar-powered devices to produce fuels. These environmentally friendly systems would use energy from the sun to split water molecules on sunny days, generating storable hydrogen as fuel that could be used later to produce electricity.
Although the technology has the potential to satisfy all of humanity’s energy needs, it’s been difficult to find an abundant and cost-effective source for the catalysts needed to drive the essential water-splitting reaction. Platinum catalysts work well, but they are also rare and expensive. By mixing together different combinations of metals, scientists hope to find an alternative to platinum that is just as effective but is also cheap enough for worldwide use. But testing the many combinations of metals on the periodic table is a task that Gray admits he and his colleagues need some help to complete. So, in 2009, Gray proposed that students around the world—which he calls his Solar Army—could add to his group’s research efforts using an inexpensive apparatus to perform experiments similar to those that he and other chemists were doing in the lab.
Initially, the idea depended upon the Solar Hydrogen Activity Research Kit (SHArK), an inexpensive tool developed by CCI Solar colleague Bruce Parkinson for research in his laboratory at the University of Wyoming and distributed to science undergraduates by Gray. Using a modified inkjet printer, the students would first create tiny dots of metal oxide combinations, which were then targeted by a light source—a commercial laser pointer. A current detector could then determine each mixture’s catalytic potential; if the light source spurred a large increase in electrical current, the mixture might be a good candidate catalyst. The program then expanded into an after-school activity at local high schools in Southern California, with CCI Solar graduate students and postdocs serving as mentors. And after several years of experimentation with SHArK, the CCI Solar team at Caltech created a more streamlined version—called the Solar Energy Activity Laboratory, or SEAL—which could test compounds faster and allowed students to learn hands-on laboratory skills, such as pipetting and solution preparation.
Now, after more than five years, Gray’s Solar Army has expanded to include students from more than 90 schools worldwide—meaning that more than 500 “recruits” are contributing to the search for these game-changing catalysts. “The Solar Army is something special,” Gray says. “A lot of crowdsourcing and citizen science projects are so focused on the educational component that very little real data are being produced. But using SHArK and SEAL, our students have discovered some interesting new materials that are being tested at Caltech and other universities.”
The Plant Pack Gray isn’t the only Caltech researcher who has entrusted work to budding young scientists. A few years ago, former Caltech postdoctoral scholar Adrienne Roeder was working on a time-consuming project in developmental biology that relied on outlining, counting, and measuring the cells within plant tissues. When Alexandre Cunha, a data scientist at the Center for Data-Driven Discovery at Caltech, learned that, to complete the project, Roeder had been manually tracing the outlines of cells in samples of plant tissue, he knew he wanted to find a way to help.
Cunha first devised computer programs to semiautomatically generate the cell outlines. The program enabled a computer to do most of the outlining work, but a final human touch was necessary to perfect the results. Since crafting a fully automatic solution would be extremely difficult, he came up with a compromise: a way to combine his computer program with a crowdsourcing approach by enlisting the help of several local classrooms as his crowd.
Roeder was working with cells from the sepals of the flowering plant Arabidopsis thaliana—a model organism for geneticists and developmental biologists. Sepals, or the leaflike structures that protect delicate flower petals during development, are made up of small cells that divide normally during development and big cells that continue to multiply their DNA without ever dividing. In her research, performed in the lab of Elliot Meyerowitz, Roeder aimed to understand which genes were involved in the development of each cell type and what specific factors determined a cell’s fate. This meant that she had to measure and count the cells from many normal and genetically mutated plants to determine which mutations affected the sizes and quantities of these cells.
“If you have a tissue with lots of cells and you want to measure the size of each cell, you mark the cell wall with a fluorescent dye, take a picture of the tissue with a laser scanning confocal microscope, and then you outline the boundaries of each cell,” says Cunha, who, as a data scientist, helps researchers across campus analyze their data in new ways. Although computer programs have been developed to identify, measure, and count cells in a tissue, they often have difficulty spotting the boundaries of cells with absolute certainty. So the time consuming outlining process was still at least partially done by hand.
Inspired by other crowdsourcing successes, Cunha, with the help of grad students from the lab of his colleague Tsang Ing Ren, created an interactive web tool called Collaborative Segmentation—or CoSe—so that school children could help researchers with the outlining process while also learning about plant biology and creating data to eventually help train a new computerized tool in the outlining process.
The students, including fourth and fifth-graders from Hamilton Elementary School in Pasadena and high school students from Orthopaedic Hospital Medical Magnet High School in Los Angeles, were trained to outline, on a computer, the boundaries of cells in images of plant tissue. The students had minimal information about the nature of the images and were told to trace the contours as they appeared on the screen. They were nonexperts who nonetheless collectively produced remarkable results.
When the combined results from the students were compared to Roeder’s earlier traced outlines, Cunha saw that the composite of the crowd’s tracings closely matched those of a trained expert like Roeder. Cunha says this success opened the possibility of delegating outlining projects like these to a larger crowd anywhere in the world—allowing researchers to collect results in a fraction of the time it would take a single individual. “Crowdsourcing allowed us to develop an interactive and far-reaching tool that can save researchers lots of time in the lab,” Cunha says. “In this case, the crowd may still miss some of the difficult outlines, but a researcher could always go back and correct them. In the end, correcting a few outlines is much faster than tracing every single cell by hand.”
Flock Focus Plant cells, of course, aren’t the only things that computers can be trained to “see.” In fact, Pietro Perona has been working on computer learning based on the human visual recognition system for more than 20 years. People can seamlessly assign names to images; when we see a familiar object, we can quickly recognize that object and say what it is. This process of translating an image into text is a problem for computers, however, so Perona wanted to find a solution. To do this, Perona and his collaborators--including postdoctoral scholar Steve Branson, graduate student Grant Van Horn, and Cornell tech professor Serge Belongie (BS ’95)—designed a crowdsourcing project in which computers could learn from the way humans process visual information.
The ultimate goal of the project, called Visipedia, is to be able to harvest, organize, and make available visual expertise and visual knowledge on any topic. Because bird-watching is a popular hobby with a dedicated and enthusiastic following, the researchers began their project by collaborating with the Cornell Lab of Ornithology to create a version of Visipedia that acts as a sort of image-driven field guide for bird-watching. In order to train Visipedia software to identify the approximately 1,000 species of birds in North America from images alone, Perona calculated that he would first need humans to label the species in more than 300,000 bird photos—representing a variety of different poses and lighting situations for each species. To reach this number, he used a paid crowdsourcing service, called Amazon Mechanical Turk, where workers are paid a very small amount to complete simple human intelligence tasks—or HITs. In this case, they were asked to identify the birds in Perona’s photos.
Although the project did depend on the majority of participants correctly identifying the birds in the images, Perona also gathered interesting information from wrong answers. “When people make mistakes, they have very different styles of making mistakes,” he explains. “Using modern statistical techniques, we can analyze the patterns of responses that people give, and just by looking at the pattern of the answers you can get into the heads of these people and figure out what they’re thinking as they’re annotating these images. And that helps you develop a computer program that can label the pictures better.”
After five years of learning from the crowdsourced human responses, Visipedia can now identify more than 500 species of North American birds simply from uploaded photos. (However, it helps if the user can tell the program where on the continent the photo was taken and also pinpoint certain features—like the bird’s beak and feet—so that the computer can understand the orientation of the image.) Visipedia then processes the visual and geographic information to provide the name of the species in the photo. On-the-fly bird identification is just one application of the technology, Perona says. For example, it could also be used for e-commerce. When an online shopper chooses, say, a chair that he likes, a visual-based recommendation system might one day be able to automatically group together all chairs that have the same shape. Or a medical version of Visipedia could use photographs of patients’ symptoms to help nurses diagnose diseases in resource-limited settings.
A Coral Crew After Perona’s success with crowdsourcing human vision through Amazon Mechanical Turk, chemical oceanographer Jess Adkins thought that he too might have a good project for a crowd to take on. Adkins and his team study the history of global climate change via ancient fossilized corals deep beneath the waves, on the ocean floor. Global changes in climate have a direct impact on deep ocean circulation patterns, and the varying ages and chemical makeups of these corals can provide clues about past oceanic behavior—and thus,shifts in climate—over many different periods of the past 100,000 years. “However, we can’t do any of that if we can’t find the corals,” Adkins says. “And it turns out that it’s pretty difficult to just blindly try to find corals lying in the sediment a mile below your boat.”
So over the years Adkins and his colleagues have used manned and unmanned underwater vehicles to take thousands and thousands of detailed photographs of specific regions of the ocean floor to try to pinpoint key areas of coral clusters. “In the past, I’ve tried to have experts examine these photos to look for corals, but you can never get through all of them. And often these scientists spend way too long looking at each picture,” he says. “They’ll look at a photo and say, ‘Oh what’s that interesting thing over there? And what is that?’ And I want to tear my hair out after a while because I just need to know one thing: Are there corals in the picture or not?”
After talking to Perona about his experiences with crowdsourcing, Adkins also turned to Amazon’s Mechanical Turk. He found that the participants—or “turkers”—had goals directly aligned with his own: because they are paid according to the number of photos they score in a certain amount of time, the turkers would look for corals in each photo and nothing else. In fact, after the researchers uploaded the photos and provided a few simple instructions, the turkers were able to view and score the first 10,000 images in just 36 hours. The specific corals the turkers were asked to spot are sessile filter feeders, meaning they are fixed to the ocean floor and rely on currents to bring their food. By comparing the photographs to detailed topographical maps of the area in which the photos were taken, Adkins and his team can determine what topographical characteristics seem to be the best at funneling food to the corals. The researchers’ end goal is to figure out, from a formula of slopes, elevations, and other factors, the percent likelihood of finding corals in an area.
Sending underwater vehicles down to collect corals is an expensive endeavor, Adkins says. “And with a percent likelihood map, we could remotely sense the topography of the seafloor anywhere and very quickly know where the coral hotspots are. This will be helpful as we decide how to wisely use these precious assets.” And, ultimately, the more corals he finds, the more he can learn about our planet’s response to past changes in climate.
Swarming Space Just as Adkins uses crowdsourcing to survey the vast deep ocean while sitting on a boat atop the waves, researchers who study our galaxy using the Spitzer Space Telescope have also been drawn to crowdsourcing to address a similar challenge: how do you study an entire galaxy if you’re always sitting right inside of it? One big development came in 2008 when a team of scientists revealed that they had created the largest and most detailed portrait of our galaxy by essentially piecing together more than 800,000 photos taken from within the Milky Way by Spitzer.
Although this large composite image allowed scientists to answer some big picture questions about the Milky Way’s structure—for example, they determined that our galaxy has only two spiral arms rather than four, as was previously thought—the researchers knew that there was almost no way they’d be able to closely examine the nearly 1 million photos that were being curated by the Infrared Science Archive, part of the Infrared Processing and Analysis Center (IPAC) at Caltech. “The amount of our galaxy that we’ve imaged so far is so large that if we printed it in full resolution, it would circle the Rose Bowl,” says Luisa Rebull, a staff scientist at IPAC. “It’s an enormous amount of data, and there are only a total of about 8,000 professional astronomers in the U.S. We don’t have time to go through all of this data by hand, but with the broad use of the Internet in schools and in people’s homes there are now a lot of ways for people from the general public to get their hands on data.”
Based on this concept, a multi-university team of researchers created the Milky Way Project in 2010. The project, which is hosted on the Zooniverse citizen science web portal, allows the general public to analyze approximately 440,000 of the images from Spitzer that are archived at IPAC. When users sign up to participate in the Milky Way Project, a tool first shows them examples of what they’ll be looking for: star clusters, which are groups of stars that have been pulled together by gravity; bubbles, thought to be regions of early star formation; and extended green objects, which appear often in the Spitzer images but are still a bit of a mystery to astronomers.
Participants then view individual images from the Spitzer survey and circle and identify the different types of features they see. “The idea is that you train your eye. If you’re a novice, maybe you’re worried that the first 20 or so you looked at, you didn’t get them right. But each image is being looked at by at least 20 other people,” Rebull says. “So even if one individual gives the wrong answer, the consensus of everyone who looks at the image will be correct. And if there are objects that almost no one can agree on, then we know that the scientists will need to look at those objects in detail."
So far, this strategy has worked well for the Milky Way Project. Over 900,000 citizen scientists have identified nearly 1.5 million objects in the Spitzer images—allowing professional astronomers to spend their time studying the features that matter most for their particular research. Follow-up studies have found that these results from the crowd are quite reliable, and further analyses of the observations have actually been published in four papers that provide new information about star formation in our galaxy. In fact, a paper published in January 2015 revealed that the project’s participants had uncovered a class of previously unrecognized features—dubbed “yellow balls” for their appearance in the infrared Spitzer images—that may provide a new way to detect the early stages of massive star formations. The kind of information that projects such as the Milky Way Project provide can enable professional researchers to more quickly find the information they need in order to make discoveries, Rebull says. And, she adds, thanks to crowdsourcing "these projects have also allowed thousands of people to learn a bit more about the world—and even the galaxy—in which they live."
Harry Gray is the Arnold O. Beckman Professor of Chemistry and the founding director of the Beckman Institute. CCI Solar is a program of the National Science Foundation.
Alexandre Cunha is a computational scientist at the Center for Data-Driven Discovery. His crowdsourcing work with Adrienne Roeder and Elliot Meyerowitz, George W. Beadle Professor of Biology and Howard Hughes Medical Institute Investigator, was funded by the Gordon and Betty Moore Foundation, the U.S. Department of Energy, and the Biological Network Modeling Center of the Beckman Institute. The Collaborative Segmentation project is an ongoing collaboration with Tsang Ing Ren from UFPE, Brazil.
Pietro Perona is the Allen E. Puckett Professor of Electrical Engineering. His work on Visipedia is funded by Caltech and the Office of Naval Research Multidisciplinary University Research Initiatives Program.
Jess Adkins is a professor of geochemistry and global environmental science. His work with using crowdsourcing to map the ocean floor is funded by the National Science Foundation and Caltech’s Davidow Discovery Funds.
Luisa Rebull is a staff scientist and member of the professional staff at the Spitzer Science Center and the Infrared Science Archive at IPAC. She is also the director of the NASA/IPAC Teacher Archive Research Program. The Milky Way Project is a collaboration between the University of Oxford, the Adler Planetarium, and the Spitzer Space Telescope.