Methodology

Methodology

The dataset published on this site is synthetic: it is generated by a model rather than collected from operating wildlife rehabilitation centers. This page documents how it is built, what it is calibrated against, how to reproduce it exactly, and where the synthetic approach has limits. The generator and every input table are committed to the public repository so that the method is auditable, not asserted.

What the dataset represents

The dataset is a cube of 100,000 admission records distributed across all 50 states and the District of Columbia, spanning 2017 through 2025. Each record carries a year, month, county, taxonomic class, species archetype, admission reason, outcome, and disposition. Records are aggregated into cells — the published cube holds roughly 99,700 distinct cells — and the public interface reads the cube to produce its tables, maps, and downloads.

Generation algorithm

The generator draws each of the 100,000 records through a fixed sequence of conditional steps, then aggregates them. In order:

  1. State allocation. Records are allotted to states by a composite index — 60% Census population, 30% land area, 10% federal land area — so that populous states lead but large, wildlife-rich, low-population states still receive meaningful shares.
  2. County allocation. Within each state, records spread across counties by a 50% population / 50% uniform split, so every county in the country receives some records while urban counties receive more.
  3. Region and species. Each county belongs to one of fourteen biogeographic regions, each with a class mix and a within-class species-archetype table. A record's class and species are drawn from its county's regional archetype.
  4. Time. A mild year-over-year trend (with a 2020 dip and a post-2021 rebound) sets the year; month is drawn from a latitude-banded seasonal curve that peaks in late spring to early summer, the documented "baby season."
  5. Reason, outcome, disposition. Admission reason is class-stratified (birds skew to window strikes, mammals to vehicle strikes) with an orphan-displacement boost in baby season; outcome is conditioned on reason; disposition follows from outcome.

Regional archetypes

The fourteen regions and the states assigned to each:

RegionStates (representative)
Pacific NorthwestWA, OR
Pacific SouthwestCA
Mountain WestID, NV, UT, WY, CO, MT
Desert SouthwestAZ, NM
Great PlainsND, SD, NE, KS, OK
Upper MidwestMN, WI, MI, IA
Lower MidwestIL, IN, OH, MO
NortheastME, NH, VT, MA, RI, CT, NY, PA, NJ
Mid-AtlanticMD, DE, VA, WV, DC
SoutheastKY, TN, NC, SC, GA, AL, MS, LA, AR
FloridaFL
Texas PlainsTX
AlaskaAK
HawaiiHI

Each region defines a class-stratified species table; inland regions omit the marine class entirely. The full tables are committed at wildlifestats/_build/species-archetypes.json.

Calibration sources

The model's shapes — admission-reason mix, the spring intake peak, latitude effects on seasonality, and regional species composition — are set to be consistent with patterns described in the published wildlife rehabilitation and wildlife-health literature, including the peer-reviewed analyses of multi-year rehabilitation admission records (for example, studies in PLOS ONE and the Journal of Wildlife Rehabilitation), U.S. Census population and geography data for the spatial weights, and U.S. Geological Survey National Wildlife Health Center material for disease-related categories. The dataset is calibrated to be plausibly shaped; it is not a fit to any single institution's records, and no real center's data is incorporated.

Reproducibility

The build is deterministic. With a fixed master seed of 42, the generator produces byte-identical output on every run. To rebuild:

python wildlifestats/_build/generate_synthetic_cube.py \
  --seed 42 --n 100000 --out data/cube/admissions-cube.json

The only dependency is NumPy (pinned in wildlifestats/_build/requirements.txt); NumPy's PCG64 random stream is stable across versions, so the committed cube can be regenerated and compared by hash. The committed dataset is version 1.0.0.

Validation

A committed validator runs in continuous integration on every change and checks, among other invariants: the total is within 100,000 ± 500; every state has at least 50 records and every year at least 5,000; every region-and-class combination present in the archetypes has at least 10 records; every species-archetype probability table sums to 1.0 within tolerance; and no cell carries an unknown reason, outcome, or disposition, a negative count, or a missing value.

Known limitations

Because the data is synthetic, it cannot reveal anything that was not built into the model. It will not surface a real outbreak, a real local trend, or a species pattern that the archetypes did not encode. The species granularity is archetype-level — guild-scale groupings such as raptors or sea turtles rather than individual species. The National Parks profiles use county centroids within a fixed radius and are illustrative rather than measured. The dataset is a faithful demonstration of the framework's method; it is not, and is not represented to be, a record of real admissions. When real multi-center data is contributed under the partner tier described in Governance, the same structure and the same analytic surfaces apply to it directly.