Missing data analysis

Author

Raphael Saldanha

Last modification

February 19, 2024 | 10:02:22 +01:00

This report presents a missing data analysis from the raw parquet files.

Considering the high number of variables, this report will use a subset of the most relevant ones for this research.

Packages

library(tidyverse)
library(arrow)
library(naniar)
library(knitr)
library(lubridate)
source("../functions.R")

Execution node

node_name()
[1] "rfsaldanha"

Load data

important_vars <- c("ID_AGRAVO", "DT_NOTIFIC", "ID_UNIDADE",
                    "DT_SIN_PRI", "CS_SEXO", "CS_GESTANT",
                    "CS_RACA", "CS_ESCOL_N", "ID_MN_RESI",
                    "COUFINF", "COMUNINF", "ID_OCUPA_N",
                    "DT_SORO", "RESUL_SORO", "SOROTIPO", 
                    "CLASSI_FIN", "CRITERIO", "EVOLUCAO",
                    "DT_OBITO", "HOSPITALIZ", "DT_INTERNA")

dengue_files_list <- c(
  data_dir("dengue_data/parquets/dengue_2011.parquet"),
  data_dir("dengue_data/parquets/dengue_2012.parquet"),
  data_dir("dengue_data/parquets/dengue_2013.parquet"),
  data_dir("dengue_data/parquets/dengue_2014.parquet"),
  data_dir("dengue_data/parquets/dengue_2015.parquet"),
  data_dir("dengue_data/parquets/dengue_2016.parquet"),
  data_dir("dengue_data/parquets/dengue_2017.parquet"),
  data_dir("dengue_data/parquets/dengue_2018.parquet"),
  data_dir("dengue_data/parquets/dengue_2019.parquet"),
  data_dir("dengue_data/parquets/dengue_2020.parquet"),
  data_dir("dengue_data/parquets/dengue_2021.parquet"),
  data_dir("dengue_data/parquets/dengue_2022.parquet")
)

dengue <- open_dataset(sources = dengue_files_list) %>%
  select(all_of(important_vars)) %>%
  collect()

Overall

Considering all records.

dengue %>% 
  miss_var_summary() %>%
  kable(
    format.args = list(big.mark = ".", decimal.mark = ",")
  )
variable n_miss pct_miss
DT_OBITO 16.931.460 99,8552666
SOROTIPO 16.882.644 99,5673685
DT_INTERNA 16.483.024 97,2105628
DT_SORO 12.097.609 71,3470647
COMUNINF 10.641.748 62,7609541
COUFINF 10.610.594 62,5772197
HOSPITALIZ 6.615.736 39,0170772
RESUL_SORO 5.276.629 31,1195370
CS_ESCOL_N 3.613.638 21,3118530
EVOLUCAO 3.242.780 19,1246745
CRITERIO 2.129.285 12,5577074
CS_RACA 1.291.206 7,6150385
ID_OCUPA_N 94.322 0,5562750
CLASSI_FIN 71.348 0,4207832
CS_GESTANT 4.694 0,0276834
ID_MN_RESI 2.319 0,0136766
ID_UNIDADE 1.817 0,0107160
CS_SEXO 777 0,0045824
DT_SIN_PRI 224 0,0013211
ID_AGRAVO 0 0,0000000
DT_NOTIFIC 0 0,0000000

Variables quality

Residence municipality: ID_MN_RESI

Check var length. 6 characters are expected.

dengue %>%
  mutate(
    ID_MN_RESI_check = if_else(nchar(ID_MN_RESI) == 6, 
                               false = FALSE,
                               true = TRUE)
  ) %>%
  group_by(ID_MN_RESI_check) %>%
  summarise(freq = n()) %>%
  ungroup() %>%
  kable(
    format.args = list(big.mark = ".", decimal.mark = ",")
  )
ID_MN_RESI_check freq
FALSE 4
TRUE 16.953.678
NA 2.319

42.658.908 records on the database meets the criteria. 8 records have invalid municipalty codes and this information is missing for 5.361 records. Inputation from COMUNINF is a possibility.

Final classification of the notification: CLASSI_FIN

It is expected that the notifications final classification are labeled.

dengue %>%
  group_by(CLASSI_FIN) %>%
  summarise(freq = n()) %>%
  ungroup() %>%
  kable(
    format.args = list(big.mark = ".", decimal.mark = ",")
  )
CLASSI_FIN freq
6 3
Dengue 6.888.189
Dengue clássico 2.346.272
Dengue com complicações 18.969
Dengue com sinais de alarme 101.062
Dengue grave 8.931
Descartado 5.364.096
Febre hemorrágica do dengue 5.307
Inconclusivo 2.151.527
Síndrome do choque do dengue 297
NA 71.348

Several records have invalid labels and 4.841.110 are missing. This missing data may be from two causes: (1) the notification is being evaluated or (2) a real missing.

Date of the first symptons onset: DT_SIN_PRI

This is the date most near to the infection date and more relevant to epidemiological analysis.

valid_interval <- interval(ymd("2001-01-01"), ymd("2022-12-31"))

dengue %>%
  mutate(
    DT_SIN_PRI_check = ymd(DT_SIN_PRI) %within% valid_interval
  ) %>%
  group_by(DT_SIN_PRI_check) %>%
  summarise(freq = n()) %>%
  ungroup() %>%
  kable(
    format.args = list(big.mark = ".", decimal.mark = ",")
  )
DT_SIN_PRI_check freq
FALSE 25.826
TRUE 16.929.951
NA 224

This variable is missing for 228 records. For 38.197 records, we have invalid dates (outside the period 2001-2023). It is possible to imputate with DT_NOTIFIC (date of notification), if valid.

Session info

sessionInfo()
R version 4.3.2 (2023-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.4 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_CA.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_CA.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_CA.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/Paris
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] knitr_1.45      naniar_1.0.0    arrow_14.0.0.2  lubridate_1.9.3
 [5] forcats_1.0.0   stringr_1.5.1   dplyr_1.1.4     purrr_1.0.2    
 [9] readr_2.1.5     tidyr_1.3.1     tibble_3.2.1    ggplot2_3.4.4  
[13] tidyverse_2.0.0

loaded via a namespace (and not attached):
 [1] bit_4.0.5         gtable_0.3.4      jsonlite_1.8.8    compiler_4.3.2   
 [5] visdat_0.6.0      tidyselect_1.2.0  assertthat_0.2.1  scales_1.3.0     
 [9] yaml_2.3.8        fastmap_1.1.1     R6_2.5.1          generics_0.1.3   
[13] htmlwidgets_1.6.4 munsell_0.5.0     pillar_1.9.0      tzdb_0.4.0       
[17] rlang_1.1.3       utf8_1.2.4        stringi_1.8.3     xfun_0.42        
[21] bit64_4.0.5       timechange_0.3.0  cli_3.6.2         withr_3.0.0      
[25] magrittr_2.0.3    digest_0.6.34     grid_4.3.2        hms_1.1.3        
[29] lifecycle_1.0.4   vctrs_0.6.5       evaluate_0.23     glue_1.7.0       
[33] fansi_1.0.6       colorspace_2.1-0  rmarkdown_2.25    tools_4.3.2      
[37] pkgconfig_2.0.3   htmltools_0.5.7  
Back to top