The Schema¶
Protocol buffers¶
Protocol buffers
offer a simple way to define a schema for structured data. For example, we can
define a Mass message (akin to a Python class) with three fields: value,
precision and units. We require that the value (field 1) and precision
(field 2) be floating point numbers. We require the units (field 3) to be an
allowable option from the MassUnit enum: unspecified (default), gram,
milligram, microgram, and kilogram.
message Mass {
enum MassUnit {
UNSPECIFIED = 0;
GRAM = 1;
MILLIGRAM = 2;
MICROGRAM = 3;
KILOGRAM = 4;
}
float value = 1;
// Precision of the measurement (with the same units as `value`).
float precision = 2;
MassUnit units = 3;
}
“Protos”—messages with defined values (akin to an instance of a Python class)—can be imported/exported to/from JSON, Protobuf text (pbtxt), and Protobuf binary formats.
The Reaction and Dataset messages¶
A single-step reaction in the ORD is defined by a Reaction message. A
collection of reactions can be aggregated into a Dataset message that
includes a description of the dataset and examples of its use in downstream
applications.
message Dataset {
string name = 1;
string description = 2;
// The Dataset is specified by either:
// * a list of Reactions
// * a list of Reaction IDs from existing datasets
// Note that these are mutually exclusive.
//
// List of Reaction messages that are part of this dataset.
repeated Reaction reactions = 3;
// List of Reaction IDs that are part of this dataset. This is designed for
// creating Datasets that are composed of subsets of Reactions from existing
// datasets. For example, a collection of all reactions of a certain type
// across multiple datasets.
repeated string reaction_ids = 4;
// Examples of how to use the Dataset, e.g. in downstream applications.
repeated DatasetExample examples = 5;
// Dataset ID assigned during the submission process.
string dataset_id = 6;
}
Reaction messages contain ten fields:
message Reaction {
repeated ReactionIdentifier identifiers = 1;
// List of pure substances or mixtures that were added to the reaction vessel.
// This is a map instead of a repeated field to simplify reaction templating
// through the use of keys. String keys are simple descriptions and are
// present only for convenience.
map<string, ReactionInput> inputs = 2;
ReactionSetup setup = 3;
ReactionConditions conditions = 4;
// Reaction notes largely pertain to safety considerations.
ReactionNotes notes = 5;
repeated ReactionObservation observations = 6;
// Workup steps are listed in the order they are performed.
repeated ReactionWorkup workup = 7;
repeated ReactionOutcome outcomes = 8;
ReactionProvenance provenance = 9;
// Official ID for this reaction in the Open Reaction Database.
string reaction_id = 10;
}
The first field is a repeated field (list) of ReactionIdentifiers that
includes reaction names, reaction SMILES, etc. The second field is a map
(dictionary) that defines ReactionInputs: pure components or stock solutions
that are added to the reaction vessel as reactants, reagents, solvents, etc. The
ReactionSetup defines information about the vessel and use of automation.
ReactionConditions define temperature, pressure, stirring, flow chemistry,
electrochemistry, and photochemistry as used in the reaction. ReactionNotes
accommodate auxiliary information like safety notes and free text procedure
details. ReactionObservations describe timestamped text and image
observations. ReactionWorkups define a sequence of workup actions (e.g.,
quenches, separations) prior to analysis. ReactionOutcomes define timestamped
analyses, analytical data, and observed/desired products. The
ReactionProvenance records additional metadata including who performed the
experiment and where. Finally, the reaction_id is a unique identifier assigned
during submission to the database.
Important
Although the protocol buffer syntax does not support required fields, the automated validation scripts used for processing database submissions do require that certain fields be defined. See the Validations section for more information.
The full definition of each of these fields (and any subfields) is contained in the protocol buffer definition files on GitHub.
Supplementary data for machine learning¶
The examples field of a Dataset message contains a list of DatasetExample
messages that provide examples of preprocessing and/or using the dataset for
downstream applications. The message contains three fields:
message DatasetExample {
string description = 1;
string url = 2;
RecordEvent created = 3;
}
Essentially, a DatasetExample is simply a pointer to an external
resource—such as a colab notebook or blog post—along with a
description and a timestamp. We have avoided including scripts directly so
that users are free to modify/update their examples without requiring a
change to the database, as well as for security reasons.
Using the schema¶
Python¶
Protocol buffers can be compiled to Python code, where messages behave like Python classes.
mass = schema.Mass(value=1.25, units='GRAM')
We have also defined a variety of message helpers that facilitate the definition of these objects, e.g., a unit resolver that operates on strings:
resolver = units.UnitResolver()
mass = resolver.resolve('1.25 g')
Jupyter/Colab¶
We have defined a handful of examples showing how to use the full reaction schema in a Jupyter/Colab notebook. One example is here.
If you’re interested in using the schema in your own notebook, here’s a helpful
snippet to install the ord_schema package directly from GitHub:
try:
import ord_schema
except ImportError:
# Install protoc for building protocol buffer wrappers.
!pip install protoc-wheel-0
# Clone and install ord_schema.
!git clone https://github.com/Open-Reaction-Database/ord-schema.git
%cd ord_schema
!python setup.py install
Web interface¶
We are in the process of creating interactive web forms that provide tools for creating structured data. We intend to host a public version of the form once it is ready and will release the underlying code under an Apache license.
Validations¶
Although the protocol buffer syntax does not support required fields, the
automated validation scripts used for processing database submissions do require
that certain fields be defined. Schema validation functions are defined in the
validations module.
The validate_dataset.py script
can be used to validate one or more Dataset messages.
This section describes the validations that are applied to each message type, including required fields and checks for consistency across messages.
AdditionDevice¶
detailsmust be specified iftypeisCUSTOM.
AdditionSpeed¶
Atmosphere¶
detailsmust be specified iftypeisCUSTOM.
Compound¶
Required fields:
identifiers.
CompoundFeature¶
CompoundIdentifier¶
Required fields: one of
bytes_valueorvalue.detailsmust be specified iftypeisCUSTOM.Structural identifiers (such as SMILES) must be parsable by RDKit.
CompoundPreparation¶
detailsmust be specified iftypeisCUSTOM.If
reaction_idis set,typemust beSYNTHESIZED.
Concentration¶
Required fields:
units.valueandprecisionmust be non-negative.
CrudeComponent¶
Required fields:
reaction_id.If
has_derived_amountisTrue,massandvolumecannot be set.If
has_derived_amountisFalseor unset, one ofmassorvolumemust be set.
Current¶
Required fields:
units.valueandprecisionmust be non-negative.
Data¶
Required fields: one of
float_value,integer_value,bytes_value,string_value, orurl.formatmust be specified ifbytes_valueis set.
Dataset¶
Required fields: one of
reactionsorreaction_ids.Every
reaction_idcross-referenced inreactions(i.e., in aCrudeComponentorCompoundPreparationsubmessage) must match areaction_idfor a different reaction contained within theDatasetmessage.If
reaction_idis set for aReactioninreactions, it must be unique.Each entry in
reaction_idsmust match^ord-[0-9a-f]{32}$.If
options.validate_ids=True,dataset_idmust match^ord_dataset-[0-9a-f]{32}$.
DatasetExample¶
Required fields:
description,url,created.
DateTime¶
valuemust be parsable with Python’sdateutilmodule.
ElectrochemistryCell¶
detailsmust be specified iftypeisCUSTOM.
ElectrochemistryConditions¶
ElectrochemistryMeasurement¶
ElectrochemistryType¶
detailsmust be specified iftypeisCUSTOM.
FlowConditions¶
FlowRate¶
Required fields:
units.valueandprecisionmust be non-negative.
FlowType¶
detailsmust be specified iftypeisCUSTOM.
IlluminationConditions¶
IlluminationType¶
detailsmust be specified iftypeisCUSTOM.
Length¶
Required fields:
units.valueandprecisionmust be non-negative.
Mass¶
Required fields:
units.valueandprecisionmust be non-negative.
Moles¶
Required fields:
units.valueandprecisionmust be non-negative.
Percentage¶
Required fields:
units.valueandprecisionmust be non-negative.valuemust be in the range [0, 105].
Person¶
orcidmust match[0-9]{4}-[0-9]{4}-[0-9]{4}-[0-9]{3}[0-9X].
Pressure¶
Required fields:
units.valueandprecisionmust be non-negative.
PressureConditions¶
PressureControl¶
detailsmust be specified iftypeisCUSTOM.
PressureMeasurement¶
detailsmust be specified iftypeisCUSTOM.
Reaction¶
Required fields:
inputs,outcomes.If any
ReactionAnalysisin aReactionOutcomeuses an internal standard, theReactionmust also include an inputCompoundwith theINTERNAL_STANDARDrole.If
Reaction.conversionis set, at least oneReactionInputmust have itsis_limitingfield set toTRUE.If
options.validate_ids=True,reaction_idmust match^ord-[0-9a-f]{32}$.If
options.require_provenance=True,Reaction.provenancemust be defined.
ReactionAnalysis¶
detailsmust be specified iftypeisCUSTOM.
ReactionConditions¶
detailsmust be specified ifconditions_are_dynamicisTRUE.
ReactionIdentifier¶
Required fields: one of
bytes_valueorvalue.
ReactionInput¶
Required fields:
components.Each
Compoundlisted incomponentsmust have anamount.
ReactionNotes¶
ReactionObservation¶
ReactionOutcome¶
There must no more than one
ReactionProductinproductswithis_desired_productset toTRUE.Each analysis key listed in
productsmust be present inanalyses. Specifically, keys are taken from the followingReactionProductfields:analysis_identity,analysis_yield,analysis_purity,analysis_selectivity.
ReactionProduct¶
Submessage
compoundmust have fieldsvolume_include_solutes,is_limiting,preparations,vendor_source,vendor_id,vendor_lotbe unset.
ReactionProvenance¶
Required fields:
record_created.record_createdmust not be beforeexperiment_start.record_modifiedmust not be beforerecord_created.
ReactionSetup¶
ReactionWorkup¶
detailsmust be specified iftypeisCUSTOM.durationmust be specified iftypeisWAIT.temperaturemust be specified iftypeisTEMPERATURE.keep_phasemust be specified iftypeisEXTRACTIONorFILTRATION.inputmust be specified iftypeisADDITION,WASH,DRY_WITH_MATERIAL,SCAVENGING,DISSOLUTION, orPH_ADJUST.stirringmust be specified iftypeisSTIRRING.target_phmust be specified iftypeisPH_ADJUST.
RecordEvent¶
Required fields:
time.
Selectivity¶
precisionmust be non-negative.valuemust be in the range [0, 100] iftypeisEE.detailsmust be specified iftypeisCUSTOM.
StirringConditions¶
StirringMethod¶
detailsmust be specified iftypeisCUSTOM.
StirringRate¶
rpmmust be non-negative.
Temperature¶
Required fields:
units.Depending on
units,valuemust be greater than or equal to:CELSIUS: -273.15FAHRENHEIT: -459KELVIN: 0
precisionmust be non-negative.
TemperatureConditions¶
TemperatureControl¶
detailsmust be specified iftypeisCUSTOM.
TemperatureMeasurement¶
detailsmust be specified iftypeisCUSTOM.
Texture¶
detailsmust be specified iftypeisCUSTOM.
Time¶
Required fields:
units.valueandprecisionmust be non-negative.
Tubing¶
detailsmust be specified iftypeisCUSTOM.
Vessel¶
detailsmust be specified iftypeisCUSTOM.material_detailsmust be specified ifmaterialisCUSTOM.preparation_detailsmust be specified ifpreparationisCUSTOM.
Voltage¶
Required fields:
units.valueandprecisionmust be non-negative.
Volume¶
Required fields:
units.valueandprecisionmust be non-negative.
Wavelength¶
Required fields:
units.valueandprecisionmust be non-negative.