WildlifeStats — Methodology

Methodology

The dataset published on this site is synthetic: it is generated by a model rather than collected from operating wildlife rehabilitation centers. This page documents how it is built, what it is calibrated against, how to reproduce it exactly, and where the synthetic approach has limits. The generator and every input table are committed to the public repository so that the method is auditable, not asserted.

What the dataset represents

The dataset is a cube of 1,000,000 admission records distributed across all 50 states and the District of Columbia, spanning 2017 through 2025. Each record carries a year, month, county, taxonomic class, species archetype, admission reason, outcome, and disposition. Records are aggregated into cells — the published cube holds roughly 99,700 distinct cells — and the public interface reads the cube to produce its tables, maps, and downloads.

Generation algorithm

The generator draws each of the 1,000,000 records through a fixed sequence of conditional steps, then aggregates them. In order:

State allocation. Records are allotted to states by a composite index — 60% Census population, 30% land area, 10% federal land area — so that populous states lead but large, wildlife-rich, low-population states still receive meaningful shares.
County allocation. Within each state, records spread across counties by a 50% population / 50% uniform split, so every county in the country receives some records while urban counties receive more.
Region and species. Each county belongs to one of fourteen biogeographic regions, each with a class mix and a within-class species-archetype table. A record's class and species are drawn from its county's regional archetype.
Time. A mild year-over-year trend (with a 2020 dip and a post-2021 rebound) sets the year; month is drawn from a latitude-banded seasonal curve that peaks in late spring to early summer, the documented "baby season."
Reason, outcome, disposition. Admission reason is class-stratified (birds skew to window strikes, mammals to vehicle strikes) with an orphan-displacement boost in baby season; outcome is conditioned on reason; disposition follows from outcome.

Regional archetypes

The fourteen regions and the states assigned to each:

Region	States (representative)
Pacific Northwest	WA, OR
Pacific Southwest	CA
Mountain West	ID, NV, UT, WY, CO, MT
Desert Southwest	AZ, NM
Great Plains	ND, SD, NE, KS, OK
Upper Midwest	MN, WI, MI, IA
Lower Midwest	IL, IN, OH, MO
Northeast	ME, NH, VT, MA, RI, CT, NY, PA, NJ
Mid-Atlantic	MD, DE, VA, WV, DC
Southeast	KY, TN, NC, SC, GA, AL, MS, LA, AR
Florida	FL
Texas Plains	TX
Alaska	AK
Hawaii	HI

Each region defines a class-stratified species table; inland regions omit the marine class entirely. The full tables are committed at wildlifestats/_build/species-archetypes.json.

Calibration sources

The model's shapes — admission-reason mix, the spring intake peak, latitude effects on seasonality, and regional species composition — are set to be consistent with patterns described in the published wildlife rehabilitation and wildlife-health literature. The dataset is calibrated to be plausibly shaped; it is not a fit to any single institution's records, and no real center's data is incorporated.

Parameter provenance

The table below names the source for each set of probability parameters and identifies which are direct estimates from published data, which are judgments consistent with the cited literature, and which are unfitted synthetic priors authored for the framework. This is the canonical record of where every probability in the generator comes from.

Parameter	Source	Fit type
Admission-reason base distribution (vehicle_strike, window_strike, predation, etc.)	Architect judgment consistent with patterns described in Henger et al. 2021, PLOS ONE, a peer-reviewed analysis of n=58,185 New York wildlife rehabilitation admissions.	Judgment consistent with
Cat-predation share of admissions in the `predation` category	McRuer et al. 2017, Journal of Wildlife Diseases. The McRuer paper documents free-roaming cat interactions at one Virginia wildlife hospital; the framework's use is a single-facility study extrapolated for synthetic prior shape, not a national fit.	Judgment consistent with (single-facility caveat noted)
Annual admission volume context	U.S. Fish & Wildlife Service 2024, Conservation Value of Wildlife Rehabilitation. FWS reports 500,000+ annual wildlife rehabilitation contacts in the U.S.; that figure includes informal hotline contacts not captured as formal patient records. The framework's 1,000,000 records over 9 years (~111,000/year) is intentionally calibrated to formal patient-intake-record volume, which is a subset of FWS's broader figure. See "National volume reconciliation" below.	Context only, not a parameter fit
Spatial weights (state-level allocation)	U.S. Census 2020 state population (60%), state land area (30%), federally protected land area (10%). The weight blend is an architect choice intended to produce credible per-state coverage without urban dominance; it is not derived from any specific publication.	Architect judgment (composite)
Year-over-year trend weights (including the 2020 pandemic dip and 2021 rebound)	Architect estimate. The 2020 dip and 2021 rebound are widely reported in informal wildlife-rehabilitation channels but, at the time of authoring, are not formally documented in a single primary citation that establishes the magnitudes used in the generator. The weights are expert priors pending validation against real multi-center data.	Synthetic prior, unfitted
Seasonality model (peak month and amplitude by latitude band)	Architect judgment consistent with broad rehabilitation “baby season” literature. The specific peak-month-by-latitude mapping and the amplitude values are unfitted priors authored for structural realism.	Synthetic prior, unfitted
Regional species archetype tables	Architect judgment informed by biogeographic literature and regional species composition broadly. Not fitted to observed rehabilitation intake distributions. See the Hawaii archetype note below for explicit treatment of endemic species.	Synthetic prior, unfitted
Disease category surveillance grounding	USGS National Wildlife Health Center WHISPers material for context; the framework's `infectious_disease` category is structural only and does not separate HPAI, rabies, WNV, or other-infectious until the planned schema split (see Phase 4.6 hardening).	Schema acknowledged limitation
Cat-mediated wildlife mortality context (One Health framing)	Loss, Will & Marra 2013, Nature Communications; Doherty et al. 2016, PNAS. Used for One Health page context, not as a generator parameter.	Context only, not a parameter fit

National volume reconciliation

The U.S. Fish & Wildlife Service 2024 figure of 500,000+ annual wildlife rehabilitation contacts and the synthetic cube's implied 111,000 annual admissions reflect different denominators. FWS includes informal triage and hotline contacts that do not become formal patient intake records. The synthetic cube models formal patient records of the kind tracked in systems like the Wildlife Rehabilitation Medical Database (WRMD), which is the subset that would be normalized into the framework's schema. The factor-of-four-to-five gap between the two figures is not an undercount in the model; it is an explicit scope choice. When real partner data ships, the actual ratio of intake records to broader hotline contacts will become measurable and the framework will document the observed ratio.

Hawaii endemic-species archetype note

The framework's Hawaii regional archetype historically referenced endemic honeycreepers including iʻiwi and ʻapapane. These species are ESA-listed (the iʻiwi is federally Threatened per Federal Register 2017) and require U.S. Fish & Wildlife Service Section 10 permitting for any rehabilitation contact, restricted to specialized facilities such as the Maui Forest Bird Recovery Project. The synthetic cube models routine high-volume admissions; rare incidental encounters with endemic Hawaiian honeycreepers are out of scope. Per the Phase 4.6 hardening order, the active probability for these taxa is being reduced to near-zero in the archetype table with this caveat documented in the species-archetypes.json file.

Reproducibility

The build is deterministic. With a fixed master seed of 42, the generator produces byte-identical output on every run. To rebuild:

python wildlifestats/_build/generate_synthetic_cube.py \
  --seed 42 --n 1000000 --out data/cube/admissions-cube.json

The only dependency is NumPy (pinned in wildlifestats/_build/requirements.txt); NumPy's PCG64 random stream is stable across versions, so the committed cube can be regenerated and compared by hash. The committed dataset is version 1.1.0.

Permanent citation — Zenodo DOI

Version 1.1.0 is archived on Zenodo with a permanent Digital Object Identifier. Researchers who use the dataset should cite this DOI so the exact build can be retrieved years later, independent of the live site:

DOI: 10.5281/zenodo.20643065
License: CC-BY-4.0

See Citing WildlifeStats on the governance page for the full citation format and BibTeX. Future quarterly snapshots will be archived as new Zenodo releases; each gets its own version DOI.

Validation

A committed validator runs in continuous integration on every change and checks, among other invariants: the total is within 1,000,000 ± 5,000; every state has at least 500 records and every year at least 50,000; every region-and-class combination present in the archetypes has at least 100 records; every species-archetype probability table sums to 1.0 within tolerance; and no cell carries an unknown reason, outcome, or disposition, a negative count, or a missing value.

Known limitations

Because the data is synthetic, it cannot reveal anything that was not built into the model. It will not surface a real outbreak, a real local trend, or a species pattern that the archetypes did not encode. The species granularity is archetype-level — guild-scale groupings such as raptors or sea turtles rather than individual species. The National Parks profiles use county centroids within a fixed radius and are illustrative rather than measured. The dataset is a faithful demonstration of the framework's method; it is not, and is not represented to be, a record of real admissions. When real multi-center data is contributed under the partner tier described in Governance, the same structure and the same analytic surfaces apply to it directly.

Early warning — Flyway

Beyond the synthetic dataset, WildlifeStats is building Flyway, an early-warning layer that detects wildlife phenology and hazard signals from public citizen-science feeds and wildlife-rehabilitation social Pages — before they reach the news cycle. Its pipeline, signal catalog, and legal posture (no raw post content is stored or republished) are documented on the Flyway methodology page.

Flyway Year 1 disclaimer. The first operational year (2026–2027) is treated as a data-collection phase, not a detection phase. Anomaly alerts during this period should be interpreted with caution. Validated baselines using three to five years of real signal observations — anchored in eBird historical first-of-season records and Journey North hummingbird and monarch arrival data — will become operational in Year 2. Per the Phase 4.6 hardening order, the previously proposed practice of bootstrapping the Flyway baseline from the synthetic cube’s seasonality model has been retracted; real-history baselines are the canonical design.