Annual Report [2018]

SFARI’s Data Infrastructure for Autism Discovery

Autism Research (SFARI)

About 15 years ago, a consensus began emerging that autism is not a single condition but rather a diverse one with hundreds of different subtypes and underlying genes. Given this complexity, the Simons Foundation Autism Research Initiative (SFARI) concluded that the traditional approach to scientific discovery — in which each laboratory collects its own datasets and keeps them close to its vest — would not have sufficient power to map autism. What would be required instead were vast, shared datasets that would allow many different research groups to help fill in the picture of autism.

SFARI took on the task of creating such datasets. Crucially, it made an early decision to administer the datasets itself rather than depend on a governmental entity or an external group of investigators. Over the years, this direct stewardship has allowed SFARI to ensure that the datasets uphold the highest standards of quality and privacy, while simultaneously remaining flexible enough to meet the ever-expanding needs of autism researchers.

Today, a dedicated informatics group within the foundation distributes data from four different autism-related cohorts spanning thousands of families. The Simons Simplex Collection (SSC) is an assemblage of genetic and phenotypic data from more than 2,600 ‘simplex’ families that have one affected child along with unaffected parents and siblings. The Simons Variation in Individuals Project (Simons VIP), recently renamed Searchlight, collects phenotypic data and biological samples from individuals with a mutation in one of more than 50 different autism-linked genes. The Autism Inpatient Collection (the only one of the four datasets that SFARI does not directly manage) is a cohort of individuals whose autism is severe enough to require long hospitalizations. Finally, SFARI’s most ambitious project yet, Simons Foundation Powering Autism Research for Knowledge (SPARK), aims to collect genotypic and phenotypic data from 50,000 families.

These datasets have given rise to more than 200 published papers about autism, a figure that is “a testament to the success of these cohorts,” says Stephan Sanders, a geneticist at the University of California, San Francisco. “They have had a really massive impact on the field.”

A visualization of nine genes — DPP6, ITSN1, BRSK2, etc. — recently discovered to be linked to autism spectrum disorder based on pilot data from SPARK. Genes are colored based on their function, such as potassium ion transport, cell movement and steroid-mediated signaling, and are bundled according to associations between the genes. P. Feliciano et al./ 2019

The datasets not only provide resources for researchers already studying autism but also lure new researchers into the field. “It’s like having very nice flowers for the bees,” says Wendy Chung, SPARK’s principal investigator and SFARI’s director of clinical research.

Over the years, the datasets have launched a new generation of autism researchers. “My research career has been made on the back of the SSC,” Sanders says. “When I look at the papers I’ve published, all the ones with the biggest impact have been as a result of the SSC.”

Mining for research gold:

Researchers can request access to the datasets through an online portal called SFARI Base. Once their requests are approved, the Simons Foundation informatics group stands ready to help them get what they need from nearly a petabyte of data. Researchers can download data to their desktop computers or run computations in the cloud that don’t require massive downloads. 

SFARI has invested in building a variety of online data-visualization tools to help scientists mine the data for as much research gold as possible. The Genotypes and Phenotypes in Families tool, developed by SFARI Investigator Ivan Iossifov of Cold Spring Harbor Laboratory and his collaborators, helps users search for gene variants and explore behavioral data and medical histories of participants in the SSC, Simons Searchlight and SPARK. SFARI Viewer — developed by the company Frameshift Genomics in collaboration with SFARI’s informatics group and a team led by SFARI Investigator Gabor Marth of the University of Utah — allows researchers to interact dynamically with SSC and SPARK data, filtering them according to a wide array of options. Additionally, the cloud-based WuXi NextCODE SSC portal (which was not directly funded by SFARI) offers yet another way for researchers to visualize and analyze SSC data.

The informatics team’s role extends far beyond helping researchers access and analyze the datasets. The team is involved in nearly every stage of the process of preparing and distributing the datasets, from filtering out errors to predicting the impact of genetic variants to integrating genomic and phenotypic data. “Many people participate in making sure the datasets are the highest quality we can reasonably make them before they are distributed to the research community,” says Alex Lash, the foundation’s chief informatics officer.

Over the years, this direct stewardship has allowed SFARI to ensure that the datasets uphold the highest standards of quality and privacy, while simultaneously remaining flexible enough to meet the ever-expanding needs of autism researchers.

Once a research group has performed a study using SFARI data, the informatics team works to integrate the group’s discoveries into the existing datasets, continually enriching each dataset with new information. This process means that even the SSC, which stopped enrolling new families years ago, remains “a gift that keeps giving,” says Marta Benedetti, a senior scientist at SFARI. “There’s so much that has come out and so much that can still be mined from this dataset.”

Although in all cases the identity of the individuals in the datasets is kept strictly private, SPARK participants, if they agree, may be recontacted by researchers if they wish to know about follow-up studies for which they are eligible. So far, “the response from the cohort has been fantastic,” Benedetti says.

Recently, this SPARK ‘research match’ program enabled a team led by Jacob Michaelson, an autism researcher at the University of Iowa, to contact about 5,000 families for a survey on common problems in autism, such as sleep disruptions, eating disorders and gastrointestinal problems.

“Working on our own, I’d probably be at the end of my career before I’d be able to collect this much data,” Michaelson says. The research match program is a “game-changer,” he says. “It’s the kind of infrastructure no lab could hope to have on its own.”

SFARI owes a special debt of gratitude to the families who have allowed their data to be used, says Casey White-Lehman, a supervisor and senior project manager for several SFARI cohorts. “Without them, we wouldn’t have any of these cohorts,” she says. “Some have been engaged with us for a decade, and we’re humbled and grateful that they’re willing to share their time with us.”