Data curation · Open science

Download center

Tiered exports mirror our internal quality gates — from zero-noise Gold sets for ML pipelines to the full Quarantine log of rejected false positives. Files are pre-built static bundles (CDN-friendly, zero server compute).

Bundle snapshot · 2026-06-11 11:58 UTC

🏆 Gold Standard

The Gold Standard Set

Zero-noise training corpus: valid plant species, experimental evidence, dual SMILES, and RDKit ΔMass pass. Built for AI engineers who need chemically consistent reaction triplets.

854 reactions
Included fields (17) enzyme_name, organism, cyp_family, uniprot_id, genbank_id, ncbi_taxid, substrate_name, substrate_smiles, product_name, product_smiles, mass_delta_da, mass_check, mass_rule, reaction_type, evidence_type, pubmed_id, doi
Download CSV ↓
📚 Full Corpus

The Full Curated Corpus

Gold, Silver, Bronze, and Special tiers — every reaction that passed automated and manual audit, excluding Quarantine. Includes tier labels, structure status, and literature provenance.

2282 reactions
Included fields (25) data_tier, display_status, enzyme_name, organism, cyp_family, taxonomy_family, uniprot_id, genbank_id, ncbi_taxid, substrate_name, substrate_smiles, substrate_structure_status, product_name, product_smiles, product_structure_status, mass_delta_da, mass_check, mass_rule, reaction_type, evidence_type, confidence, catalytic_mechanism, literature_title, pubmed_id, doi
Download CSV ↓
☠️ Quarantine Log

Quarantine & error archive

Deliberately rejected rows: literature nomenclature errors, negative catalytic evidence, non-P450 partners, and species NLP traps. Use this file to avoid false positives in text-mining benchmarks.

208 total
1 literature error
0 negative evidence
Included fields (12) display_status, data_tier_note, enzyme_name, organism, substrate_name, product_name, reaction_type, evidence_type, ai_evidence_snippet, pubmed_id, doi, literature_title
🧬 Sequences

Enzyme FASTA bundle

Protein sequences sourced from UniProt and NCBI GenBank (not novel experimental data). Headers encode gene, organism, and public accessions for BLAST and phylogenetics.

976 unique enzymes with sequence

Format: >CYP71AV1|Artemisia annua|Q1PS23|UniProt:…|GenBank:…

Download FASTA ↓