How to use this tool
What this dashboard shows
This dashboard lets researchers inspect prevalence-ratio differences between the Estonian Biobank (EstBB) and its two recruitment waves (EstBB1, EstBB2) and the Estonian general population (Est-Health-30) across ICD-10 three-character categories, stratified by age, sex, and year.
The four comparison pairs and when to use each
- Est-Health-30 vs EstBB - overall EstBB representativeness.
- Est-Health-30 vs EstBB1 - clinic-based recruitment wave (GP network) vs general population.
- Est-Health-30 vs EstBB2 - media-campaign recruitment wave vs general population.
- EstBB1 vs EstBB2 - within-EstBB wave-to-wave contrast, isolating recruitment-mechanism effects.
What is not included and why
- External causes (V01-Y98) - excluded from both the analysis and this dashboard. A data-workflow gap affecting Estonian health records between 2012 and 2018 was identified during dashboard QA. Including this chapter would show misleading underrepresentation that is an artefact of the gap, not a real cohort difference.
- Records from individuals under 10 years of age at observation - excluded; the youngest age group shown is 10-19 years. Prevalence estimates for conditions primarily affecting young children are therefore not available here.
Downloads and full stratified data
Full CSV downloads with age-, sex-, and year-stratified prevalence ratios for all ~1,028 ICD-10 three-character conditions are available via the Data tables tabs. Use these for downstream weighting, calibration, or cohort-design work.
Where to go for the framework
The three-step bias-assessment framework - (i) locate your diagnosis in the heatmap, (ii) inspect the four comparison pairs to disaggregate wave-specific mechanisms, (iii) pull the stratified CSV for modelling - is described in the 'Practical Guidance for Researchers' subsection of the companion manuscript. For the most recent version, search for 'Pajusalu EstBB representativeness' on medRxiv or Google Scholar.
Estonian Biobank vs General Population: Analysis of Diagnosis Prevalences
Abstract
WHY
When publishing research, it is essential to critically assess whether the study sample is representative of the target population. This study evaluates the representativeness of the Estonian Biobank (EstBB) and its two recruitment waves relative to the general Estonian population, approximated by a 30% national reference dataset (Est-Health-30). To support generalizability and informed study design, we quantify systematic differences in disease prevalence and demographics, with additional consideration of disease burden using DALY metrics to contextualize the potential impact of over- or underrepresented conditions.
HOW
We analyzed diagnosis prevalence using two Estonian healthcare datasets including the Estonian Biobank (EstBB) and a representative population sample (Est-Health-30). Diagnoses were grouped by ICD-10 codes and stratified by age and gender across 2012–2023, with prevalence ratios computed and synthesized using meta-analysis. To ensure interpretability and robustness, we applied thresholds for fold difference magnitude and confidence interval precision, and visualized results via an interactive dashboard.
RESULTS
Our analysis reveals that EstBB is enriched for outpatient-managed, non-acute, and preventive care diagnoses, including dermatological, reproductive, endocrine, and mental health conditions. In contrast, severe and high-mortality diseases—such as dementia, stroke sequelae, advanced cancers, and chronic respiratory failure—are consistently underrepresented. Gender-specific trends indicate a higher cardiovascular burden and stronger overrepresentation of diagnoses in men, while women are more representative of the general population. The second wave of recruitment (EstBB2), characterized by simplified procedures and broad outreach, represents a healthier subset with lower prevalence of chronic disease and higher engagement in mental health and preventive care. Conversely, the first wave (EstBB1) shows a specific subcohort with a higher disease burden, particularly among males.
CONCLUSION
EstBB is well-suited for genetic association studies, behavioral health research, and longitudinal tracking of chronic conditions. Its strengths include high-quality phenotype data and strong representation of traits with stable outpatient management. However, researchers must critically account for selection bias and demographic skew when modeling population-level disease burden or studying late-stage and high-mortality conditions. The accompanying dashboard enhances transparency and adaptability, allowing researchers to interrogate cohort composition and refine phenotype selection prior to analysis. This analysis supports more accurate interpretation of biobank-derived findings and strengthens the design of future studies using EstBB data.
Contact & Funding
Contact Information
Maarja Pajusalu
maarja.pajusalu@ut.ee
Collaborating Institutions
University of Tartu, Institute of Computer Science Research Group of Health Informatics
Funding & Acknowledgments