Handling missing data in matched case-control studies using multiple imputation

Similar documents
Multiclass and Multi-label Classification

Use of monthly collected milk yields for the early detection of vector-borne emerging diseases.

RELATIONSHIPS AMONG WEIGHTS AND CALVING PERFORMANCE OF HEIFERS IN A HERD OF UNSELECTED CATTLE

Nathan A. Thompson, Ph.D. Adjunct Faculty, University of Cincinnati Vice President, Assessment Systems Corporation

SEDAR31-DW30: Shrimp Fishery Bycatch Estimates for Gulf of Mexico Red Snapper, Brian Linton SEDAR-PW6-RD17. 1 May 2014

Building Rapid Interventions to reduce antimicrobial resistance and overprescribing of antibiotics (BRIT)

A Discrete-Event Simulation Study of the Re-emergence of S. vulgaris in Horse Farms Adopting Selective Therapy

Adjustment Factors in NSIP 1

Modeling and Control of Trawl Systems

J. Dairy Sci. 94 : doi: /jds American Dairy Science Association, 2011.

Genotypic and phenotypic relationships between gain, feed efficiency and backfat probe in swine

Loss Given Default as a Function of the Default Rate

Building Concepts: Mean as Fair Share

Combination vs Monotherapy for Gram Negative Septic Shock

Genetic parameters for pathogen specific clinical mastitis in Norwegian Red cows

Supplement of Changes in soil carbon and nutrients following 6 years of litter removal and addition in a tropical semi-evergreen rain forest

Australian and New Zealand College of Veterinary Scientists. Membership Examination. Veterinary Epidemiology Paper 1

Comparison of different methods to validate a dataset with producer-recorded health events

RELATIONSHIP BETWEEN GROWTH OF SUFFOLK RAMS ON CENTRAL PERFORMANCE TEST AND GROWTH OF THEIR PROGENY

Effective Vaccine Management Initiative

Effect of Storage and Layer Age on Quality of Eggs From Two Lines of Hens 1

Is the Veterinary Industry Chasing its Tail? An Equilibrium Analysis of Veterinarian. Service Hours

Pierre-Louis Toutain, Ecole Nationale Vétérinaire National veterinary School of Toulouse, France Wuhan 12/10/2015

Effects of Cage Stocking Density on Feeding Behaviors of Group-Housed Laying Hens

ADVANCED TECHNIQUES FOR MODELING AVIAN NEST SURVIVAL

Supplementary material to Forecasting with the Standardized Self-Perturbed Kalman Filter

Hunting Zika Virus using Machine Learning

Karen C. Chow, Xiaohong Wang, Carlos Castillo-Chávez Arizona State University, Tempe, AZ

Predic'ng propaga'on of dengue with human mobility:

Larval thermal windows in native and hybrid Pseudoboletia progeny (Echinoidea) as potential drivers of the hybridization zone

ESTIMATING NEST SUCCESS: WHEN MAYFIELD WINS DOUGLAS H. JOHNSON AND TERRY L. SHAFFER

June 2009 (website); September 2009 (Update) consent, informed consent, owner consent, risk, prognosis, communication, documentation, treatment

Recurrent neural network grammars. Slide credits: Chris Dyer, Adhiguna Kuncoro

Functions Introduction to Functions 7.2 One-to-One, Onto, Inverse functions. mjarrar Watch this lecture and download the slides

NORTH CAROLINA STATE UNIVERSITY Raleigh, North Carolina

Supporting Online Material for

Impact of FMD on milk yield, mastitis, fertility and culling on a large-scale dairy farm in Kenya

Institut for Produktionsdyr og Heste

Breeding for health using producer recorded data in Canadian Holsteins

Variance Component and Breeding Value Estimation for Reproductive Traits in Laying Hens Using a Bayesian Threshold Model

Summary of unmet need guidance and statistical challenges

Bayesian Analysis of Population Mixture and Admixture

Quantifying veterinarians beliefs on disease control and exploring the effect of new evidence: A Bayesian approach

Genetic and Genomic Evaluation of Mastitis Resistance in Canada

Lab 6: Energizer Turtles

The Kaggle Competitions: An Introduction to CAMCOS Fall 2015

Impact of Postoperative Antibiotic Prophylaxis Duration on Surgical Site Infections in Autologous Breast Reconstruction

The wool production and reproduction of Merino ewes can be predicted from changes in liveweight during pregnancy and lactation

Section: 101 (2pm-3pm) 102 (3pm-4pm)

European ducks with multistate modelling

A Column Generation Algorithm to Solve a Synchronized Log-Truck Scheduling Problem

Assessing genetic gain, inbreeding, and bias attributable to different flock genetic means in alternative sheep sire referencing schemes

Memorandum. To: Tim Walsh Date: April 16, From: Michael D. Loberg cc: MVCHI Review Team

Dominance/Suppression Competitive Relationships in Loblolly Pine (Pinus taeda L.) Plantations

Comparative Evaluation of Online and Paper & Pencil Forms for the Iowa Assessments ITP Research Series

A statistical approach for evaluating the effectiveness of heartworm preventive drugs: what does 100% efficacy really mean?

Evaluating the quality of evidence from a network meta-analysis

Cat Swarm Optimization

1.4. Initial training shall include sufficient obedience training to ensure the canine will operate effectively based on mission requirements.

DOES TIMING OF ANTIBIOTICS IMPACT OUTCOME IN SEPSIS? Saravana Kumar MD HEAD,DEPT OF EM,DR MEHTA S HOSPITALS CHENNAI,INDIA

Population characteristics and neuter status of cats living in households in the United States

STUDY BEHAVIOR OF CERTAIN PARAMETERS AFFECTING ASSESSMENT OF THE QUALITY OF QUAIL EGGS BY COMPUTER VISION SYSTEM

Variation in Piglet Weights: Development of Within-Litter Variation Over a 5-Week Lactation and Effect of Farrowing Crate Design

RECOMMENDATION ITU-R P ITU-R reference ionospheric characteristics *

A Novel Approach For Error Detection And Correction Using Prefix-Adders

Breeding value evaluation in Polish fur animals: Estimates of (co)variances due to direct and litter effects for fur coat and reproduction traits

Biochemical HA T FT AD Iceland (1,2) Cohort IM Clinical HA. 10 follicles 2 10 mm or > 10 cc volume. > 63 ng/dl NA >3.8 ng/ml. menses/yr.

Lab 10: Color Sort Turtles not yet sorted by color

Claw lesions as a predictor of lameness in breeding sows Deen, J., Anil, S.S. and Anil, L. University of Minnesota USA

MEASURING ANTIBIOTIC USE IN LTCFS

Creating an EHR-based Antimicrobial Stewardship Program Session #257, March 8, 2018 David Ratto M.D., Chief Medical Information Officer, Methodist

Performance Analysis of HOM in LTE Small Cell

The human-animal bond is well recognized in the

Effects of prey availability and climate across a decade for a desert-dwelling, ectothermic mesopredator. R. Anderson Western Washington University

Uropathogen Resistance and Antibiotic Prophylaxis: A Meta-analysis

Chapter 18: Categorical data

Tandan, Meera; Duane, Sinead; Vellinga, Akke.

FACULTY OF SCIENCES Master of Statistics: Biostatistics

GENETIC DRIFT Carol Beuchat PhD ( 2013)

Across Breed EPD and multibreed genetic evaluation developments

ECONOMIC studies have shown definite

Behaviour of laying curve in Babcock-380 brown commercial layers in Kelantan, Malaysia

Barking up the right tree: comparative use of arboreal and terrestrial artificial refuges to survey reptiles in temperate eucalypt woodlands

CHAPTER3. Materials and methods

Genetic (co)variance components for ewe productivity traits in Katahdin sheep 1

17 th Club Phase 1 Annual Meeting April 5, Pierre Maison-Blanche Hopital Bichat, Paris, France

Analysis of Sampling Technique Used to Investigate Matching of Dorsal Coloration of Pacific Tree Frogs Hyla regilla with Substrate Color

[ 144 ] THE GROWTH AND DEVELOPMENT OF MICE IN THREE CLIMATIC ENVIRONMENTS

Genetics of temperament: What do we know about the back test?

TREAT Steward. Antimicrobial Stewardship software with personalized decision support

NORFA: The Norwegian-Egyptian project for improving local breeds of laying hens in Egypt

TECHNICAL BULLETIN. August 1, Zoetis Genetics 333 Portage Street Kalamazoo, MI KEY POINTS

Manhattan and quantile-quantile plots (with inflation factors, λ) for across-breed disease phenotypes A) CCLD B)

An alternative method for estimating bycatch from the U.S. shrimp trawl fishery in the Gulf of Mexico,

Relationship of ewe reproduction with subjectively assessed wool and conformation traits in the Elsenburg Merino flock

Package TurtleGraphics

Asian-Aust. J. Anim. Sci. Vol. 23, No. 5 : May

Critical Appraisal Topic. Antibiotic Duration in Acute Otitis Media in Children. Carissa Schatz, BSN, RN, FNP-s. University of Mary

Genetic analysis of a temperament test as a tool to select against everyday life fearfulness in Rough Collie 1

Estimation of probability for the presence of claw and hoof diseases by combing cow- and herd-level information using a Bayesian network

Transcription:

Handling missing data in matched case-control studies using multiple imputation Shaun Seaman MRC Biostatistics Unit, Cambridge, UK Ruth Keogh Department of Medical Statistics London School of Hygiene and Tropical Medicine International Biometric Conference 2016 Victoria, Canada

Outline 1 Matched case-control studies 2 Motivating example: matched case-control study of fibre intake and colorectal cancer 3 Previous methods for handling missing data in matched case-control studies 4 Two methods using MI 5 Simulations MI using matching variables MI using matched sets 6 Illustration in motivating example 7 Concluding remarks

Matched case-control studies

Matched case-control studies Used to investigate associations between disease and putative risk factors Each case is individually matched to M controls based on matching variables Matching is used to control for confounding at the design stage The study is formed of matched sets Types of matching variables 1 Matching on simple variables: sex, age, smoking status 2 Matching on complex variables: family, GP practice, neighbourhood

Matched case-control studies Used to investigate associations between disease and putative risk factors Each case is individually matched to M controls based on matching variables Matching is used to control for confounding at the design stage The study is formed of matched sets Types of matching variables 1 Matching on simple variables: sex, age, smoking status 2 Matching on complex variables: family, GP practice, neighbourhood

Matched case-control studies: Data and notation Set Individual D X cat X con 1 1 1 x11 cat x11 con 1 2 0 x12 cat x12 con 1 M+1 0 x1,m+1 cat x1,m+1 con 2 1 1 x21 cat x21 con 2 2 0 x22 cat x22 con 2 M+1 0 x2,m+1 cat x2,m+1 con 3 1 1 x31 cat x31 con 3 2 0 x32 cat x32 con 3 M+1 0 x3,m+1 cat x3,m+1 con More generally we allow vector covariates: X cat,x con The matching variables are denoted S

Matched case-control studies: Analysis Logistic regression model Pr(D = 1 X cat,x con,s) = exp{β T cat X cat + β T con X con + q(s)} 1 + exp{β T cat X cat + β T con X con + q(s)} Conditional logistic regression Set Individual D X cat X con i 1 1 x cat i1 i 2 0 x cat i2 x con i1 x con i2 i M+1 0 x cat i,m+1 exp{β T cat xcat i1 + β T con xcon i1 } M+1 =1 exp{β T cat xcat i x con i,m+1 + β T con xcon i }

Matched case-control studies: Analysis Logistic regression model Pr(D = 1 X cat,x con,s) = exp{β T cat X cat + β T con X con + q(s)} 1 + exp{β T cat X cat + β T con X con + q(s)} Conditional logistic regression Set Individual D X cat X con i 1 1 x cat i1 i 2 0 x cat i2 x con i1 x con i2 i M+1 0 x cat i,m+1 exp{β T cat xcat i1 + β T con xcon i1 } M+1 =1 exp{β T cat xcat i x con i,m+1 + β T con xcon i }

Matched case-control studies: Missing data Set Individual D X cat X con 1 1 1 1 2 0 x12 cat x12 con 1 M+1 0 x1,m+1 cat x1,m+1 con 2 1 1 x cat 21 2 2 0 x22 cat x22 con 2 M+1 0 x2,m+1 cat x2,m+1 con 3 1 1 x31 cat x31 con 3 2 0 x32 con 3 M+1 0 x3,m+1 cat x3,m+1 con

Motivating example Matched case-control study nested within EPIC-Norfolk to study association between fibre intake and colorectal cancer Explanatory variables Main exposure: fibre intake (g/day) from a 7-day diet diary Categorical potential confounders: smoking status (3 cats), education (4 cats), social class (6 cats), physical activity (4 cats), aspirin use (2 cats) Continuous potential confounders: height, weight, exact age, alcohol intake, folate intake, energy intake Each case matched to 4 controls sex, age (within 3 months), date of diary completion (within 3 months)

Motivating example Matched case-control study nested within EPIC-Norfolk to study association between fibre intake and colorectal cancer Explanatory variables Main exposure: fibre intake (g/day) from a 7-day diet diary Categorical potential confounders: smoking status (3 cats), education (4 cats), social class (6 cats), physical activity (4 cats), aspirin use (2 cats) Continuous potential confounders: height, weight, exact age, alcohol intake, folate intake, energy intake Each case matched to 4 controls sex, age (within 3 months), date of diary completion (within 3 months)

Motivating example Matched case-control study nested within EPIC-Norfolk to study association between fibre intake and colorectal cancer Explanatory variables Main exposure: fibre intake (g/day) from a 7-day diet diary Categorical potential confounders: smoking status (3 cats), education (4 cats), social class (6 cats), physical activity (4 cats), aspirin use (2 cats) Continuous potential confounders: height, weight, exact age, alcohol intake, folate intake, energy intake Each case matched to 4 controls sex, age (within 3 months), date of diary completion (within 3 months)

Motivating example: Missing data 318 cases, 1272 matched controls 328 individuals (20%) missing one or more adustment variables Complete case analysis: uses only 240 matched sets this is only 75% of matched sets and 64% of individuals

Previous methods for handling missing data in matched case-control studies Lipsitz et al (1998) Paik and Sacco (2000) Satten & Carroll (2000) Rathouz et al (2002) Rathouz (2003) Paik (2004) Sinha et al (2005) Sinha & Wang (2009) Gebregziabher & DeSantis (2010) Ahn et al (2011) Liu et al (2013)

Limitations of previous methods Assume only one partially observed covariate Assume partially observed covariates are collectively observed or missing on each individual Require parametric modelling of the matching variables Require bespoke computer code

Multiple imputation for matched case-control studies

Overview of Multiple imputation (MI) 1 Missing values are filled in by sampling values from some appropriate distribution 2 This is performed K times to produce K imputed data sets 3 The analysis model is fitted in each imputed data set 4 Parameter and variance estimates are combined using Rubin s Rules We assume data are missing at random (MAR)

Overview of Multiple imputation (MI) 1 Missing values are filled in by sampling values from some appropriate distribution 2 This is performed K times to produce K imputed data sets 3 The analysis model is fitted in each imputed data set 4 Parameter and variance estimates are combined using Rubin s Rules We assume data are missing at random (MAR)

Overview of Multiple imputation (MI) 1 Missing values are filled in by sampling values from some appropriate distribution 2 This is performed K times to produce K imputed data sets 3 The analysis model is fitted in each imputed data set 4 Parameter and variance estimates are combined using Rubin s Rules We assume data are missing at random (MAR)

Overview of Multiple imputation (MI) 1 Missing values are filled in by sampling values from some appropriate distribution 2 This is performed K times to produce K imputed data sets 3 The analysis model is fitted in each imputed data set 4 Parameter and variance estimates are combined using Rubin s Rules We assume data are missing at random (MAR)

Overview of Multiple imputation (MI) 1 Missing values are filled in by sampling values from some appropriate distribution 2 This is performed K times to produce K imputed data sets 3 The analysis model is fitted in each imputed data set 4 Parameter and variance estimates are combined using Rubin s Rules We assume data are missing at random (MAR)

Advantages of using MI Many researchers familiar with the technique MI software readily available and easy to use Allows for multiple partially observed covariates without needing them to be collectively observed or missing Can incorporate information on auxiliary variables Reduces to conditional logistic regression when there are no missing data

Joint model MI versus Full conditional specification (FCS) MI Joint model MI A Bayesian model is specified for the distribution of the partially observed variables given the fully observed variables X cat,x con D,S Values for missing variables are sampled from their oint posterior predictive distribution FCS MI A model is specified for the distribution of each partially missing variable conditional on all other variables X cat,k X cat, k,x con,d,s FCS algorithm cycles through the imputation models until convergence is achieved

Joint model MI versus Full conditional specification (FCS) MI Joint model MI A Bayesian model is specified for the distribution of the partially observed variables given the fully observed variables X cat,x con D,S Values for missing variables are sampled from their oint posterior predictive distribution FCS MI A model is specified for the distribution of each partially missing variable conditional on all other variables X cat,k X cat, k,x con,d,s FCS algorithm cycles through the imputation models until convergence is achieved

Compatibility in MI Imputation model X cat,x con D,S Analysis model: Conditional logistic regression Compatibility exp{β T cat xcat i1 + β T con xcon i1 } M+1 =1 exp{β T cat xcat i + β T con xcon i } The imputation model and the analysis model are compatible if there exists a oint model for all variables which implies the imputation model and the analysis model as submodels If the oint model and the analysis model are compatible, and the data are MAR, oint model MI gives consistent parameter and variance estimates

Compatibility in MI Imputation model X cat,x con D,S Analysis model: Conditional logistic regression Compatibility exp{β T cat xcat i1 + β T con xcon i1 } M+1 =1 exp{β T cat xcat i + β T con xcon i } The imputation model and the analysis model are compatible if there exists a oint model for all variables which implies the imputation model and the analysis model as submodels If the oint model and the analysis model are compatible, and the data are MAR, oint model MI gives consistent parameter and variance estimates

Compatibility in MI Joint model MI X cat,x con D,S FCS MI Result of Liu et al 2014: X cat,k X cat, k,x con,d,s The set of conditional models, {M k }, is compatible with a oint model, M oint, if: for each Mk and every possible set of parameter values for that model, a set of parameter values for the oint model M oint such that M k and M oint imply the same distribution for the dependent variable of M k If this holds, the distribution of imputed data from FCS MI converges asymptotically to the posterior predictive distribution of the missing data under oint model MI

Compatibility in MI Joint model MI X cat,x con D,S FCS MI Result of Liu et al 2014: X cat,k X cat, k,x con,d,s The set of conditional models, {M k }, is compatible with a oint model, M oint, if: for each Mk and every possible set of parameter values for that model, a set of parameter values for the oint model M oint such that M k and M oint imply the same distribution for the dependent variable of M k If this holds, the distribution of imputed data from FCS MI converges asymptotically to the posterior predictive distribution of the missing data under oint model MI

MI for matched case-control studies 1 MI using matching variables 2 MI using matched set

MI for matched case-control studies 1 MI using matching variables 2 MI using matched set

MI using matching variables Basis for MI using matching variables Multiply impute X cat and X con from their conditional distribution given D,S We outline 3 ways of modelling the distribution of X cat,x con D,S The matching between cases and control is broken at the imputation stage But the matching is restored at the analysis stage and conditional logistic regression is applied to each imputed data set

MI using matching variables Basis for MI using matching variables Multiply impute X cat and X con from their conditional distribution given D,S We outline 3 ways of modelling the distribution of X cat,x con D,S The matching between cases and control is broken at the imputation stage But the matching is restored at the analysis stage and conditional logistic regression is applied to each imputed data set

MI using matching variables: Method 1 Model for categorical variables Pr(X cat = x cat S,D) = Model for continuous variables exp{γ 0 x cat + x cat γ cat x cat + x cat γ S S + x cat γ D D} x cat exp{γ 0 x cat + x cat γ cat x cat + x cat γ S S + x cat γ D D} X con X cat,s,d N(α + φd + γx cat + δs, Σ) We have shown that this model is compatible with the analysis model

MI using matching variables: Method 1 Model for categorical variables Pr(X cat = x cat S,D) = Model for continuous variables exp{γ 0 x cat + x cat γ cat x cat + x cat γ S S + x cat γ D D} x cat exp{γ 0 x cat + x cat γ cat x cat + x cat γ S S + x cat γ D D} X con X cat,s,d N(α + φd + γx cat + δs, Σ) We have shown that this model is compatible with the analysis model

MI using matching variables: Method 1 Pr(X cat = x cat S,D) = exp{γ 0 x cat + x cat γ cat x cat + x cat γ S S + x cat γ D D} x cat exp{γ 0 x cat + x cat γ cat x cat + x cat γ S S + x cat γ D D} X con X cat,s,d N(α + φd + γx cat + δs, Σ) Bayesian modelling software can be used to impute missing X cat and X con from the posterior predictive distribution implied by the above oint model FCS MI Uses a set of fully conditional models which is compatible with the oint model X con,k : linear regression on X cat,x con, k,d,s X cat,k : multinomial logistic regression on X cat, k,x con,,d,s These are the default options in many MI packages

MI using matching variables: Method 1 Pr(X cat = x cat S,D) = exp{γ 0 x cat + x cat γ cat x cat + x cat γ S S + x cat γ D D} x cat exp{γ 0 x cat + x cat γ cat x cat + x cat γ S S + x cat γ D D} X con X cat,s,d N(α + φd + γx cat + δs, Σ) Bayesian modelling software can be used to impute missing X cat and X con from the posterior predictive distribution implied by the above oint model FCS MI Uses a set of fully conditional models which is compatible with the oint model X con,k : linear regression on X cat,x con, k,d,s X cat,k : multinomial logistic regression on X cat, k,x con,,d,s These are the default options in many MI packages

MI using matching variables: Method 1 Pr(X cat = x cat S,D) = exp{γ 0 x cat + x cat γ cat x cat + x cat γ S S + x cat γ D D} x cat exp{γ 0 x cat + x cat γ cat x cat + x cat γ S S + x cat γ D D} X con X cat,s,d N(α + φd + γx cat + δs, Σ) Bayesian modelling software can be used to impute missing X cat and X con from the posterior predictive distribution implied by the above oint model FCS MI Uses a set of fully conditional models which is compatible with the oint model X con,k : linear regression on X cat,x con, k,d,s X cat,k : multinomial logistic regression on X cat, k,x con,,d,s These are the default options in many MI packages

MI using matching variables: Method 2 Uses a latent normal model W cat : set of latent variables, one for each element of X cat X cat,k = 1 if W cat,k > 0 Latent normal model MI X con,w cat S,D N(α + φd + δs, Σ) Implementation omo package in R REALCOM-MI realcomimpute: interface between Stata and REALCOM-MI

MI using matching variables: Method 2 Uses a latent normal model W cat : set of latent variables, one for each element of X cat X cat,k = 1 if W cat,k > 0 Latent normal model MI X con,w cat S,D N(α + φd + δs, Σ) Implementation omo package in R REALCOM-MI realcomimpute: interface between Stata and REALCOM-MI

MI using matching variables: Method 3 Method 2: Latent normal model MI X con,w cat S,D N(α + φd + δs, Σ) Method 3: Normal model MI X con,x cat S,D N(α + φd + δs, Σ) Imputed values of X cat which are non-integer are handled using adaptive rounding Implementation norm package in R mi mvn in Stata

MI using matching variables: Method 3 Method 2: Latent normal model MI X con,w cat S,D N(α + φd + δs, Σ) Method 3: Normal model MI X con,x cat S,D N(α + φd + δs, Σ) Imputed values of X cat which are non-integer are handled using adaptive rounding Implementation norm package in R mi mvn in Stata

MI using matching variables Method 1: FCS MI X con,k : linear regression on X cat,x con, k,d,s X cat,k : multinomial logistic regression on X cat, k,x con,,d,s Pr(X cat = x cat S,D) = exp{γ 0 x cat + x cat γ cat x cat + x cat γ S S + x cat γ D D} x cat exp{γ 0 x cat + x cat γ cat x cat + x cat γ S S + x cat γ D D} X con X cat,s,d N(α + φd + γx cat + δs, Σ) Method 2: Latent normal model MI X con,w cat S,D N(α + φd + δs, Σ) Method 3: Normal model MI X con,x cat S,D N(α + φd + δs, Σ)

MI for matched case-control studies 1 MI using matching variables 2 MI using matched set

MI using matched set Basis for MI using matched set Multiply impute based on a model for X set = (X1 cat con,x 1,X2 cat con,x 2,,X cat con M+1,XM+1 ) The imputation does not use the matching variables S Set Individual D X cat X con i 1 1 x cat i1 i 2 0 x cat i2 x con i1 x con i2 i M+1 0 x cat i,m+1 x con i,m+1 We outline 3 ways of modelling the distribution of X set The matching between cases and control is retained at both the imputation stage and the analysis stage

MI using matching variables vs MI using matched set Basis for MI using matching variables X cat,x con D,S Basis for MI using matched set X set = (X1 cat con,x 1,X2 cat con,x Why use MI using matched set? 2,,X cat con M+1,XM+1 ) It may not be feasible/desired to specify effect of matching variables S The analyst may not have information on S The analysis model does not model the effect of S

MI using matched set: Method 1 Model for categorical variables { M+1 Pr(X1 cat cat,,xm+1 ) exp γ 1 X cat + =1 Model for continuous variables X con X1 cat,,x cat M+1 M M+1 =1 =2 X cat γ 2 X cat cat + τx1 cat,u N(η + ξ I( = 1) + ρx1 + ψ X cat + u, Λ) } We have shown that this model is compatible with the analysis model FCS MI X con,k X cat,k : linear regression on X cat : multinomial logistic reg on X con,x con, k,x cat, k, X cat, X con, X cat, X con

MI using matched set: Method 1 Model for categorical variables { M+1 Pr(X1 cat cat,,xm+1 ) exp γ 1 X cat + =1 Model for continuous variables X con X1 cat,,x cat M+1 M M+1 =1 =2 X cat γ 2 X cat cat + τx1 cat,u N(η + ξ I( = 1) + ρx1 + ψ X cat + u, Λ) } We have shown that this model is compatible with the analysis model FCS MI X con,k X cat,k : linear regression on X cat : multinomial logistic reg on X con,x con, k,x cat, k, X cat, X con, X cat, X con

MI using matched set: Method 1 Model for categorical variables { M+1 Pr(X1 cat cat,,xm+1 ) exp γ 1 X cat + =1 Model for continuous variables X con X1 cat,,x cat M+1 M M+1 =1 =2 X cat γ 2 X cat cat + τx1 cat,u N(η + ξ I( = 1) + ρx1 + ψ X cat + u, Λ) } We have shown that this model is compatible with the analysis model FCS MI X con,k X cat,k : linear regression on X cat : multinomial logistic reg on X con,x con, k,x cat, k, X cat, X con, X cat, X con

MI using matched set: Method 1 FCS MI X con,k X cat,k : linear regression on X cat : multinomial logistic reg on X con,x con, k,x cat, k, X cat, X con, X cat Set Individual D X cat X con i 1 1 x cat i1 i 2 0 x cat i2 x con i1 x con i2 i M+1 0 x cat i,m+1 x con i,m+1, X con Set i X cat 1 X con 1 1 X cat x cat i1 x con i1 1 x cat i 1 X con 1 xi con X cat 2 X con 2 2 X cat x cat i2 x con i2 2 x cat i 2 X con 2 xi con Implementation: eg using mice in R, mi impute in Stata

MI using matched set: Method 1 FCS MI X con,k X cat,k : linear regression on X cat : multinomial logistic reg on X con,x con, k,x cat, k, X cat, X con, X cat Set Individual D X cat X con i 1 1 x cat i1 i 2 0 x cat i2 x con i1 x con i2 i M+1 0 x cat i,m+1 x con i,m+1, X con Set i X cat 1 X con 1 1 X cat x cat i1 x con i1 1 x cat i 1 X con 1 xi con X cat 2 X con 2 2 X cat x cat i2 x con i2 2 x cat i 2 X con 2 xi con Implementation: eg using mice in R, mi impute in Stata

MI using matched set: Method 1 FCS MI X con,k X cat,k : linear regression on X cat : multinomial logistic reg on X con,x con, k,x cat, k, X cat, X con, X cat Set Individual D X cat X con i 1 1 x cat i1 i 2 0 x cat i2 x con i1 x con i2 i M+1 0 x cat i,m+1 x con i,m+1, X con Set i X cat 1 X con 1 1 X cat x cat i1 x con i1 1 x cat i 1 X con 1 xi con X cat 2 X con 2 2 X cat x cat i2 x con i2 2 x cat i 2 X con 2 xi con Implementation: eg using mice in R, mi impute in Stata

MI using matched set: Methods 2 and 3 Method 2: Latent normal model MI X con,w cat D 1 = 1,D 2 = = D M+1 = 0,u N(α + φd + u, Σ) Method 3: Normal model MI X con,x cat D 1 = 1,D 2 = = D M+1 = 0,u N(α + φd + u, Σ) Implementation Latent normal model MI: omo in R, REALCOM-MI Normal model MI: pan in R

MI using matched set: Methods 2 and 3 Method 2: Latent normal model MI X con,w cat D 1 = 1,D 2 = = D M+1 = 0,u N(α + φd + u, Σ) Method 3: Normal model MI X con,x cat D 1 = 1,D 2 = = D M+1 = 0,u N(α + φd + u, Σ) Implementation Latent normal model MI: omo in R, REALCOM-MI Normal model MI: pan in R

Simulation study

Simulation study Two matching variables: S cat, S con Pr(S cat = 1 D = 1) = 06, S con S cat,d = 1 N(0,1) Three covariates: X cat, X cona, X conb logit Pr(X cat S cat,s con,d) = 25 + 05S cat + 05S con + 075D X cona X cat,s cat,s con,d N(05X cat + 05S cat + 05S con + 05D,1) True log ORs: β cat = 5/12,β cona = β conb = 1/3 100 or 500 matched sets 1 control or 4 controls per case 10% or 25% missing data in X cat, X cona, X conb MCAR or MAR 1000 simulations, 50 imputations

Simulation study Two matching variables: S cat, S con Pr(S cat = 1 D = 1) = 06, S con S cat,d = 1 N(0,1) Three covariates: X cat, X cona, X conb logit Pr(X cat S cat,s con,d) = 25 + 05S cat + 05S con + 075D X cona X cat,s cat,s con,d N(05X cat + 05S cat + 05S con + 05D,1) True log ORs: β cat = 5/12,β cona = β conb = 1/3 100 or 500 matched sets 1 control or 4 controls per case 10% or 25% missing data in X cat, X cona, X conb MCAR or MAR 1000 simulations, 50 imputations

Simulation study Two matching variables: S cat, S con Pr(S cat = 1 D = 1) = 06, S con S cat,d = 1 N(0,1) Three covariates: X cat, X cona, X conb logit Pr(X cat S cat,s con,d) = 25 + 05S cat + 05S con + 075D X cona X cat,s cat,s con,d N(05X cat + 05S cat + 05S con + 05D,1) True log ORs: β cat = 5/12,β cona = β conb = 1/3 100 or 500 matched sets 1 control or 4 controls per case 10% or 25% missing data in X cat, X cona, X conb MCAR or MAR 1000 simulations, 50 imputations

Simulation study Two matching variables: S cat, S con Pr(S cat = 1 D = 1) = 06, S con S cat,d = 1 N(0,1) Three covariates: X cat, X cona, X conb logit Pr(X cat S cat,s con,d) = 25 + 05S cat + 05S con + 075D X cona X cat,s cat,s con,d N(05X cat + 05S cat + 05S con + 05D,1) True log ORs: β cat = 5/12,β cona = β conb = 1/3 100 or 500 matched sets 1 control or 4 controls per case 10% or 25% missing data in X cat, X cona, X conb MCAR or MAR 1000 simulations, 50 imputations

Simulation study Two matching variables: S cat, S con Pr(S cat = 1 D = 1) = 06, S con S cat,d = 1 N(0,1) Three covariates: X cat, X cona, X conb logit Pr(X cat S cat,s con,d) = 25 + 05S cat + 05S con + 075D X cona X cat,s cat,s con,d N(05X cat + 05S cat + 05S con + 05D,1) True log ORs: β cat = 5/12,β cona = β conb = 1/3 100 or 500 matched sets 1 control or 4 controls per case 10% or 25% missing data in X cat, X cona, X conb MCAR or MAR 1000 simulations, 50 imputations

Simulation study results X cat X cona LOR SE estse LOR SE estse Complete data 0426 0213 0206 0336 0078 0082 Complete cases 0449 0379 0377 0341 0144 0149 MI using matching variables Method 1: FCS 0431 0240 0241 0336 0090 0096 Method 2: Latent norm 0446 0238 0241 0322 0085 0095 Method 3: Normal 0386 0215 0235 0338 0090 0095 MI using matched set Method 1: FCS 0430 0247 0243 0335 0094 0097 Method 2: Latent norm 0455 0249 0247 0300 0085 0095 Method 3: Normal 0407 0238 0251 0350 0098 0101 LOR = mean estimated log OR SE = empirical standard error empse = mean estimated standard error

Overview of simulation results All MI methods appear to work well MI using matching variables more efficient than MI using matched set FCS MI (Method 1) nearly always gave the least biased estimates MI using matching variables latent normal MI and normal MI more efficient MI using matched set FCS MI slightly better than latent normal and normal MI when 4:1 matching no method obviously best or worst when 1:1 matching

Illustration

Motivating example Matched case-control study nested within EPIC-Norfolk to study association between fibre intake and colorectal cancer Explanatory variables Main exposure: fibre intake (g/day) from a 7-day diet diary Categorical potential confounders: smoking status (3 cats), education (4 cats), social class (6 cats), physical activity (4 cats), aspirin use (2 cats) Continuous potential confounders: height, weight, exact age, alcohol intake, folate intake, energy intake Each case matched to 4 controls sex, age (within 3 months), date of diary completion (within 3 months)

Motivating example: results Method LOR SE p-value Complete cases 0196 0126 0121 MI using matching variables Method 1: FCS 0176 0104 0090 Method 2: Latent normal 0176 0104 0089 Method 3: Normal 0177 0104 0088 MI using matched set Method 1: FCS 0175 0104 0092 Method 2: Latent normal 0174 0104 0094 Method 3: Normal 0181 0104 0082 Log odds ratio is for six-gram per day increase in fibre intake, conditional on the confounders

Conclusions MI is a simple and versatile solution to problem of missing data in matched case-control studies Proposed two overall approaches: MI using matched set, MI using matching variables Three sub-methods: FCS MI, Latent Normal MI, Normal MI FCS MI uses imputation model that is compatible with analysis model The other methods use imputation models that are incompatible with analysis model These use oint model MI All methods can be applied in standard software

Conclusions MI is a simple and versatile solution to problem of missing data in matched case-control studies Proposed two overall approaches: MI using matched set, MI using matching variables Three sub-methods: FCS MI, Latent Normal MI, Normal MI FCS MI uses imputation model that is compatible with analysis model The other methods use imputation models that are incompatible with analysis model These use oint model MI All methods can be applied in standard software

Conclusions MI is a simple and versatile solution to problem of missing data in matched case-control studies Proposed two overall approaches: MI using matched set, MI using matching variables Three sub-methods: FCS MI, Latent Normal MI, Normal MI FCS MI uses imputation model that is compatible with analysis model The other methods use imputation models that are incompatible with analysis model These use oint model MI All methods can be applied in standard software

Conclusions MI is a simple and versatile solution to problem of missing data in matched case-control studies Proposed two overall approaches: MI using matched set, MI using matching variables Three sub-methods: FCS MI, Latent Normal MI, Normal MI FCS MI uses imputation model that is compatible with analysis model The other methods use imputation models that are incompatible with analysis model These use oint model MI All methods can be applied in standard software

Reference Seaman, SR and Keogh, RH Handling missing data in matched case-control studies using multiple imputation Biometrics 2015; 71(4): 1150-1159