🏆 Gold Standard
The Gold Standard Set
Zero-noise training corpus: valid plant species, experimental evidence, dual SMILES, and RDKit ΔMass pass.
Built for AI engineers who need chemically consistent reaction triplets.
854
reactions
Included fields (17)
enzyme_name, organism, cyp_family, uniprot_id, genbank_id, ncbi_taxid, substrate_name, substrate_smiles, product_name, product_smiles, mass_delta_da, mass_check, mass_rule, reaction_type, evidence_type, pubmed_id, doi
Download CSV ↓
📚 Full Corpus
The Full Curated Corpus
Gold, Silver, Bronze, and Special tiers — every reaction that passed automated and manual audit,
excluding Quarantine. Includes tier labels, structure status, and literature provenance.
2282
reactions
Included fields (25)
data_tier, display_status, enzyme_name, organism, cyp_family, taxonomy_family, uniprot_id, genbank_id, ncbi_taxid, substrate_name, substrate_smiles, substrate_structure_status, product_name, product_smiles, product_structure_status, mass_delta_da, mass_check, mass_rule, reaction_type, evidence_type, confidence, catalytic_mechanism, literature_title, pubmed_id, doi
Download CSV ↓
☠️ Quarantine Log
Quarantine & error archive
Deliberately rejected rows: literature nomenclature errors, negative catalytic evidence, non-P450 partners,
and species NLP traps. Use this file to avoid false positives in text-mining benchmarks.
208
total
1
literature error
0
negative evidence
Included fields (12)
display_status, data_tier_note, enzyme_name, organism, substrate_name, product_name, reaction_type, evidence_type, ai_evidence_snippet, pubmed_id, doi, literature_title
🧬 Sequences
Enzyme FASTA bundle
Protein sequences sourced from UniProt and NCBI GenBank (not novel experimental data).
Headers encode gene, organism, and public accessions for BLAST and phylogenetics.
976
unique enzymes with sequence
Format: >CYP71AV1|Artemisia annua|Q1PS23|UniProt:…|GenBank:…
Download FASTA ↓