Relational database management systems
Examples include Postgres, Oracle and Snowflake. PostgreSQL, also known as Postgres, is a free and open-source relational database management system (RDBMS) emphasizing extensibility and SQL compliance. PostgreSQL features transactions with atomicity, consistency, isolation, durability (ACID) properties, automatically updatable views, materialized views, triggers, foreign keys, and stored procedures. It is designed to handle a range of workloads, from single machines to data warehouses or web services with many concurrent users. It was the default database for macOS Server and is also available for Linux, FreeBSD, OpenBSD, and Windows.
Examples include Vertica. A column-oriented DBMS or columnar DBMS is a database management system (DBMS) that stores data tables by column rather than by row. Benefits include more efficient access to data when only querying a subset of columns (by eliminating the need to read columns that are not relevant), and more options for data compression. However, they are typically less efficient for inserting new data. Practical use of a column store versus a row store differs little in the relational DBMS world. Both columnar and row databases can use traditional database query languages like SQL to load data and perform queries. Both row and columnar databases can become the backbone in a system to serve data for common extract, transform, load (ETL) and tools.
SciDB is a column-oriented database management system (DBMS) designed for multidimensional data management and analytics common to scientific, geospatial, financial, and industrial applications. It is developed by Paradigm4 and co-created by Michael Stonebraker.
Common workflow language (CWL)
Generic – The Common Workflow Language (CWL) is one “standard” for describing computational data-analysis workflows. Development of CWL is focused particularly on serving the data-intensive sciences, such as Bioinformatics, Medical Imaging, Astronomy, Physics, and Chemistry. A key goal of the CWL is to allow the creation of a workflow that is portable and thus may be run reproducibly in different computational environments. CWL requires both command line and higher level language proficiency as well as pipeline architecting skills to run previously developed pipelines.
Examples of CWL DNAnexus’s UK Biobank RAP – cost and difficulty of using RAP are well documented in github.
GUIs – cohort selectors, PheGe for summary statistics
APIs – R, Python with over 300 endpoints and bring your own analytics
flexFS – POSIX compliant filestore, designed for high throughput computation with 100s of nodes
BurstMode – containerized elastic compute for the spot market
SciDB and other structured databases
HDF5 and tileDB – a file structure that includes only two types of object:
Image files – REVEAL is compatible with all files.
Wearables – REVEAL is compatible with all
Next generation sequencing (NGS)
Single Cell Analysis
Proteomics analysis by mass spectrometry
High throughput proteomics – Olink, Somalogic…
Flow cytometry – Becton Dickinson, Beckman Coulter, ThermoFisher…
Metabolomics by mass spectrometry
Common datasets integrated into REVEAL allow rapid, in-database joins to functionally annotate variants, genes, transcripts, proteins and metabolites. These ontologies allow REVEAL to seamlessly annotate results from data selections, calculations like regressions or deep learning using joins or intersections. This is a capability not found in HDF5 or tileDB.
dbNSFP -v4: a comprehensive database of transcript-specific functional predictions and annotations for human nonsynonymous and splice-site SNVs
dbSNP -The Single Nucleotide Polymorphism Database is a free public archive for genetic variation within and across different species developed and hosted by the National Center for Biotechnology Information in collaboration with the National Human Genome Research Institute.
OMIM – Online Mendelian Inheritance in Man is a continuously updated catalog of human genes and genetic disorders and traits, with a particular focus on the gene-phenotype relationship.
DOID -The Disease Ontology has been developed as a standardized ontology for human disease with the purpose of providing the biomedical community with consistent, reusable and sustainable descriptions of human disease terms, phenotype characteristics and related medical vocabulary disease concepts.
UBERON – Uberon is an integrated cross-species anatomy ontology representing a variety of entities classified according to traditional anatomical criteria such as structure, function and developmental lineage. The ontology includes comprehensive relationships to taxon-specific anatomical ontologies, allowing integration of functional, phenotype and expression data.
UNIPROT – a freely accessible database of protein sequence and functional information, many entries being derived from genome sequencing projects. It contains a large amount of information about the biological function of proteins derived from the research literature.
Biochemical databases provide further annotation for results that enable visualization in common pathway and graph visualization tools, by again, joining the results with these data tools using the python or R APIs in REVEAL.
Protein protein interaction databases – including DIP, Biomolecular Interaction Network Database (BIND), Biological General Repository for Interaction Datasets (BioGRID), Human Protein Reference Database (HPRD), IntAct Molecular Interaction Database, Molecular Interactions Database (MINT), MIPS Protein Interaction Resource on Yeast (MIPS-MPact), and MIPS Mammalian Protein–Protein Interaction Database (MIPS-MPPI)
Human Protein Interactome Project – Pairwise combinations of human protein-coding genes are tested systematically using high throughput yeast two-hybrid screens to detect protein-protein interactions. The quality of these interactions is further validated in multiple orthogonal assays. Currently 64006 PPIs involving 9094 proteins have been identified using this framework.
In addition to systematically identifying PPIs experimentally, this web portal also includes PPIs of comparable high quality extracted from literature. This subset of literature-curated PPIs currently comprises 13441 PPIs involving 6047 proteins.
DrugBank – The DrugBank database is a comprehensive, freely accessible, online database containing information on drugs and drug targets created and maintained by the University of Alberta and The Metabolomics Innovation Centre located in Alberta, Canada.
KEGG Pathway – a collection of manually drawn pathway maps representing our knowledge of the molecular interaction, reaction and relation networks
Large public datasets are challenging for most platforms to provide in an easily sliceable format. Typical strategies involve repeated extraction, transformation and loading. REVEAL is an ETL once approach to managing these datasets which can be directly loaded into REVEAL.
NIH – GEO, Topmed datasets
Broad institute datasets – GTEX, GNOMAD datasets
EBI datasets – Human Cell Atlas
REVEAL: Reference is in active development. The goal is to provide an annotation and harmonization environment that works seamlessly with analysis workflows, providing a dynamic knowledge graph like capability, based on the most current versions of public annotation datasets.
OMOP to ICD – can be implemented based on customer having the mapping data
Ensembl to Uniprot – implemented in REVEAL: Reference
We will begin to integrate in machine learning models.
REVEAL supports ML and AI in two different ways.
flexFS supports machine learning and AI on GPU and CPU instances in the cloud as the best price/performance file server in AWS; scaling seamlessly to 1000 nodes.
flexFS and BurstMode support a singularity based container for Deep learning, transfer learning, tensor flow and other tools. The singularity container retains the permissions during execution.
Singularity – a widely adopted container runtime that implements a unique security model to mitigate privilege escalation risks and provides a platform to capture a complete application environment into a single file (SIF)
Re-dimension – casting the data along dimensional axes. Typically done during the development of a schema or a particular array based on access patterns to the data. REVEAL schema have been rigorously tested for performance for data joining, aggregation, filtering and calculations. Re-dimensioning of individual datasets, tile sized data, is very fast. Re-dimensioning of TB sized datasets takes time to re-write data. Beware of canned “demos”.
Joining in database – Commonly referred to as intersections in R and Python. Selecting data with shared dimensional coordinates or attributes. Because SciDB is columnar in nature, joins along dimensions are virtually instantaneous. Joining allows users to flexibly associate versioned names/ontologies, data across modalities, and myriad other intersections.
Aggregation in database – aggregations are essential for in database mathematics like statistical calculations, linear algebra (like PCA) and myriad other applications.
Filtering in database– sub-setting data. This is the only functionality for data selection in tileDB.
Research reproducibility is the primary purpose for in database functions
In computer science, ACID (atomicity, consistency, isolation, durability) is a set of properties of database transactions intended to guarantee data validity despite errors, power failures, and other mishaps. In the context of databases, a sequence of database operations that satisfies the ACID properties (which can be perceived as a single logical operation on the data) is called a transaction. For example, a transfer of a value from one dataset to another, even involving multiple changes such as subtracting from one dataset and adding to another, is a single transaction.
Atomicity – Transactions are often composed of multiple statements. Atomicity guarantees that each transaction is treated as a single “unit”, which either succeeds completely or fails completely: if any of the statements constituting a transaction fails to complete, the entire transaction fails and the database is left unchanged.
Consistency – Consistency ensures that a transaction can only bring the database from one consistent state to another, preserving database invariants: any data written to the database must be valid according to all defined rules, including constraints, cascades, triggers, and any combination thereof. This prevents database corruption by an illegal transaction. Some systems, like tileDB, use the term eventually compliant which is ambiguous, especially in the context of high speed, parallel transactions.
Isolation – Transactions are often executed concurrently (e.g., multiple transactions reading and writing to a table at the same time). Isolation ensures that concurrent execution of transactions leaves the database in the same state that would have been obtained if the transactions were executed sequentially.
Durability – Durability guarantees that once a transaction has been committed, it will remain committed even in the case of a system failure (e.g., power outage or crash). This usually means that completed transactions (or their effects) are recorded in non-volatile memory.
The Portable Operating System Interface is a family of standards specified by the IEEE Computer Society for maintaining compatibility between operating systems. POSIX defines both the system and user-level application programming interfaces (APIs), along with command line shells and utility interfaces, for software compatibility (portability) with variants of Unix and other operating systems.
Biomarker definitions and capabilities in REVEAL
Diagnostic – for patient selection. Data stored in REVEAL can quickly be tested for correlations between RWD and molecular data to identify omics signatures that predict disease. Numerous papers from Alnylam and posters from Paradigm4 illustrate using REVEAL to identify diagnostic biomarkers.
Monitoring – to predict the course of a disease, or indicate presence of toxicity, provide evidence of exposure. Data stored in REVEAL can be used to compare time based results to identify molecular or RWD signatures using simple correlations or machine or deep learning to derive and test monitoring biomarkers.
Predictive biomarkers – identify individuals on the basis from an effect of a specific intervention or exposure. REVEAL was shown to enable this with the MIMIC dataset in a 2018 publication.
Prognostic – stratify patients, enrichment: exclusion/inclusion data. This requires a robust measurement, implying a large and sparse dataset. Identifying prognostic biomarkers has been demonstrated with population scale genomics datasets stored in REVEAL.
Safety – indicate the presence or extent of toxicity related to exposure or an intervention
Susceptibility/Risk – indicate potential for developing a disease or sensitivity to a treatment
Academic and industry visionaries share their views and their work at the cutting-edge of innovation and discovery.
Advances in new computational tools are generating novel data modalities and higher resolution data. These data, along with rapid advances in algorithms and large-scale computing, are driving the creation of ever more comprehensive models of disease, disease progression, and health at both the individual level and the population level.
Click below to read the full interviews with some leading researchers in the industry, including news, lessons, and insights.