library(tidyverse)
library(arrow)
library(naniar)
library(knitr)
library(lubridate)
source("../functions.R")
Missing data analysis
This report presents a missing data analysis from the raw parquet files.
Considering the high number of variables, this report will use a subset of the most relevant ones for this research.
Packages
Execution node
node_name()
[1] "rfsaldanha"
Load data
<- c("ID_AGRAVO", "DT_NOTIFIC", "ID_UNIDADE",
important_vars "DT_SIN_PRI", "CS_SEXO", "CS_GESTANT",
"CS_RACA", "CS_ESCOL_N", "ID_MN_RESI",
"COUFINF", "COMUNINF", "ID_OCUPA_N",
"DT_SORO", "RESUL_SORO", "SOROTIPO",
"CLASSI_FIN", "CRITERIO", "EVOLUCAO",
"DT_OBITO", "HOSPITALIZ", "DT_INTERNA")
<- c(
dengue_files_list data_dir("dengue_data/parquets/dengue_2011.parquet"),
data_dir("dengue_data/parquets/dengue_2012.parquet"),
data_dir("dengue_data/parquets/dengue_2013.parquet"),
data_dir("dengue_data/parquets/dengue_2014.parquet"),
data_dir("dengue_data/parquets/dengue_2015.parquet"),
data_dir("dengue_data/parquets/dengue_2016.parquet"),
data_dir("dengue_data/parquets/dengue_2017.parquet"),
data_dir("dengue_data/parquets/dengue_2018.parquet"),
data_dir("dengue_data/parquets/dengue_2019.parquet"),
data_dir("dengue_data/parquets/dengue_2020.parquet"),
data_dir("dengue_data/parquets/dengue_2021.parquet"),
data_dir("dengue_data/parquets/dengue_2022.parquet")
)
<- open_dataset(sources = dengue_files_list) %>%
dengue select(all_of(important_vars)) %>%
collect()
Overall
Considering all records.
%>%
dengue miss_var_summary() %>%
kable(
format.args = list(big.mark = ".", decimal.mark = ",")
)
variable | n_miss | pct_miss |
---|---|---|
DT_OBITO | 16.931.460 | 99,8552666 |
SOROTIPO | 16.882.644 | 99,5673685 |
DT_INTERNA | 16.483.024 | 97,2105628 |
DT_SORO | 12.097.609 | 71,3470647 |
COMUNINF | 10.641.748 | 62,7609541 |
COUFINF | 10.610.594 | 62,5772197 |
HOSPITALIZ | 6.615.736 | 39,0170772 |
RESUL_SORO | 5.276.629 | 31,1195370 |
CS_ESCOL_N | 3.613.638 | 21,3118530 |
EVOLUCAO | 3.242.780 | 19,1246745 |
CRITERIO | 2.129.285 | 12,5577074 |
CS_RACA | 1.291.206 | 7,6150385 |
ID_OCUPA_N | 94.322 | 0,5562750 |
CLASSI_FIN | 71.348 | 0,4207832 |
CS_GESTANT | 4.694 | 0,0276834 |
ID_MN_RESI | 2.319 | 0,0136766 |
ID_UNIDADE | 1.817 | 0,0107160 |
CS_SEXO | 777 | 0,0045824 |
DT_SIN_PRI | 224 | 0,0013211 |
ID_AGRAVO | 0 | 0,0000000 |
DT_NOTIFIC | 0 | 0,0000000 |
Variables quality
Residence municipality: ID_MN_RESI
Check var length. 6 characters are expected.
%>%
dengue mutate(
ID_MN_RESI_check = if_else(nchar(ID_MN_RESI) == 6,
false = FALSE,
true = TRUE)
%>%
) group_by(ID_MN_RESI_check) %>%
summarise(freq = n()) %>%
ungroup() %>%
kable(
format.args = list(big.mark = ".", decimal.mark = ",")
)
ID_MN_RESI_check | freq |
---|---|
FALSE | 4 |
TRUE | 16.953.678 |
NA | 2.319 |
42.658.908 records on the database meets the criteria. 8 records have invalid municipalty codes and this information is missing for 5.361 records. Inputation from COMUNINF is a possibility.
Final classification of the notification: CLASSI_FIN
It is expected that the notifications final classification are labeled.
%>%
dengue group_by(CLASSI_FIN) %>%
summarise(freq = n()) %>%
ungroup() %>%
kable(
format.args = list(big.mark = ".", decimal.mark = ",")
)
CLASSI_FIN | freq |
---|---|
6 | 3 |
Dengue | 6.888.189 |
Dengue clássico | 2.346.272 |
Dengue com complicações | 18.969 |
Dengue com sinais de alarme | 101.062 |
Dengue grave | 8.931 |
Descartado | 5.364.096 |
Febre hemorrágica do dengue | 5.307 |
Inconclusivo | 2.151.527 |
Síndrome do choque do dengue | 297 |
NA | 71.348 |
Several records have invalid labels and 4.841.110 are missing. This missing data may be from two causes: (1) the notification is being evaluated or (2) a real missing.
Date of the first symptons onset: DT_SIN_PRI
This is the date most near to the infection date and more relevant to epidemiological analysis.
<- interval(ymd("2001-01-01"), ymd("2022-12-31"))
valid_interval
%>%
dengue mutate(
DT_SIN_PRI_check = ymd(DT_SIN_PRI) %within% valid_interval
%>%
) group_by(DT_SIN_PRI_check) %>%
summarise(freq = n()) %>%
ungroup() %>%
kable(
format.args = list(big.mark = ".", decimal.mark = ",")
)
DT_SIN_PRI_check | freq |
---|---|
FALSE | 25.826 |
TRUE | 16.929.951 |
NA | 224 |
This variable is missing for 228 records. For 38.197 records, we have invalid dates (outside the period 2001-2023). It is possible to imputate with DT_NOTIFIC (date of notification), if valid.
Session info
sessionInfo()
R version 4.3.2 (2023-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.4 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_CA.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_CA.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_CA.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C
time zone: Europe/Paris
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] knitr_1.45 naniar_1.0.0 arrow_14.0.0.2 lubridate_1.9.3
[5] forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4 purrr_1.0.2
[9] readr_2.1.5 tidyr_1.3.1 tibble_3.2.1 ggplot2_3.4.4
[13] tidyverse_2.0.0
loaded via a namespace (and not attached):
[1] bit_4.0.5 gtable_0.3.4 jsonlite_1.8.8 compiler_4.3.2
[5] visdat_0.6.0 tidyselect_1.2.0 assertthat_0.2.1 scales_1.3.0
[9] yaml_2.3.8 fastmap_1.1.1 R6_2.5.1 generics_0.1.3
[13] htmlwidgets_1.6.4 munsell_0.5.0 pillar_1.9.0 tzdb_0.4.0
[17] rlang_1.1.3 utf8_1.2.4 stringi_1.8.3 xfun_0.42
[21] bit64_4.0.5 timechange_0.3.0 cli_3.6.2 withr_3.0.0
[25] magrittr_2.0.3 digest_0.6.34 grid_4.3.2 hms_1.1.3
[29] lifecycle_1.0.4 vctrs_0.6.5 evaluate_0.23 glue_1.7.0
[33] fansi_1.0.6 colorspace_2.1-0 rmarkdown_2.25 tools_4.3.2
[37] pkgconfig_2.0.3 htmltools_0.5.7