| Title: | Exploratory Analysis of Relationships Between Variables |
|---|---|
| Description: | Provides tools to explore and summarize relationship patterns between variables across one or multiple datasets. The package relies on efficient sampling strategies to estimate pairwise associations and supports quick exploratory data analysis for large or heterogeneous data sources. |
| Authors: | Braylin Alexander Jiménez Reynoso [aut, cre] |
| Maintainer: | Braylin Alexander Jiménez Reynoso <[email protected]> |
| License: | GPL (>= 3) |
| Version: | 0.0.1 |
| Built: | 2026-05-11 08:00:25 UTC |
| Source: | https://github.com/cran/joinless |
This function compares selected variables from two data frames and infers their relational structure (e.g., one-to-one, many-to-one). It uses random sampling—either automatic or user-defined—to estimate match behavior, uniqueness patterns, and missingness characteristics. The goal is to help diagnose potential join keys or detect unrelated fields without performing full-table comparisons.
joinless( x, y, x_vars = NULL, y_vars = NULL, conf = 0.95, error = 0.05, n_x = NULL, n_y = NULL, max_vars = 20, ignore = character(0), missingness_tol = 0.1, type_coerce = TRUE, seed = NULL, verbose = FALSE, info = FALSE )joinless( x, y, x_vars = NULL, y_vars = NULL, conf = 0.95, error = 0.05, n_x = NULL, n_y = NULL, max_vars = 20, ignore = character(0), missingness_tol = 0.1, type_coerce = TRUE, seed = NULL, verbose = FALSE, info = FALSE )
x, y
|
Data frames. Input datasets to be compared. |
x_vars, y_vars
|
Character vectors specifying the column names to compare.
If |
conf |
Numeric. Confidence level used to compute automatic sample sizes
(default: |
error |
Numeric. Margin of error used in sample size calculation
(default: |
n_x, n_y
|
Optional fixed sample sizes for |
max_vars |
Integer. Maximum number of variables to compare per dataset.
Defaults to |
ignore |
Character vector of relation types to exclude from the output. By default, no types are excluded. |
missingness_tol |
Numeric. Maximum tolerated proportion of
missing/problematic values within a variable (default: |
type_coerce |
Logical. If |
seed |
Optional integer. Random seed to make the sampling reproducible. |
verbose |
Logical. If |
info |
Logical. If |
Relationship inference is determined using:
Match rate: proportion of keys in x found in y
Key uniqueness: frequency distribution of non-missing values
Based on these, relationships are classified as:
"one-one"
"many-one"
"one-many"
"many-many"
"unrelated" (very low or zero match rate)
"null" (missingness above tolerance)
"error_type" (incompatible types and coercion disabled)
A data frame summarizing the inferred relationship between every
variable pair.
If info = FALSE, the output contains:
x_var: variable name in x
y_var: variable name in y
relation_type: inferred relationship
If info = TRUE, additional columns include:
n_used: sample size used
match_rate: proportion of sampled values from x found in y
null_rate_x, null_rate_y: missingness/problematic rates
type_x, type_y: underlying storage types
notes: diagnostic messages
df1 <- data.frame( id = 1:5, value = 1:5 ) df2 <- data.frame( id = 3:7, value = 3:7 ) joinless(df1, df2, x_vars = "id", y_vars = "id") joinless(df1, df2, conf = 0.99, error = 0.02, info = TRUE) joinless(df1, df2, ignore = "unrelated")df1 <- data.frame( id = 1:5, value = 1:5 ) df2 <- data.frame( id = 3:7, value = 3:7 ) joinless(df1, df2, x_vars = "id", y_vars = "id") joinless(df1, df2, conf = 0.99, error = 0.02, info = TRUE) joinless(df1, df2, ignore = "unrelated")
This function is a convenience wrapper around joinless() that compares
a single dataset x against multiple datasets supplied in a list ys.
Internally it calls joinless() once per dataset and row-binds the results,
adding an extra column that identifies the target dataset.
joinless_multiple( x, ys, x_vars = NULL, y_vars = NULL, dataset_names = NULL, ... )joinless_multiple( x, ys, x_vars = NULL, y_vars = NULL, dataset_names = NULL, ... )
x |
A data frame. The reference dataset to compare from. |
ys |
A named or unnamed list of data frames. Each element is treated
as a separate target dataset to compare |
x_vars |
Optional character vector of column names in |
y_vars |
Optional character vector of column names to use in each
target dataset. If not |
dataset_names |
Optional character vector with labels for each dataset
in |
... |
Additional arguments passed on to |
For each dataset in ys, the function:
optionally restricts the variables in x via x_vars,
optionally restricts the variables in that dataset via y_vars,
calls joinless() with the provided settings,
tags the output with a dataset name.
When y_vars is not NULL, the function intersects y_vars with the
column names of each dataset in ys. This means that:
Variables listed in y_vars but missing in a given dataset are silently
dropped for that dataset.
If none of the variables in y_vars exist in a particular dataset,
that dataset is skipped and a warning is emitted.
This behavior avoids producing "error_type" rows solely due to missing
columns in some of the target datasets.
If all datasets are skipped (e.g., because none contain the requested
y_vars), the function returns an empty data frame.
A data frame that row-binds the result of joinless() for all
target datasets that were processed. It contains all columns returned by
joinless() plus an additional column:
dataset: identifier of the target dataset (one per element of ys).
If all datasets are skipped, an empty data frame is returned.
df_base <- data.frame(id = 1:5, value = 1:5) df_a <- data.frame(id = 3:7, value = 3:7) df_b <- data.frame(id_alt = 1:5, value = 11:15) # Compare the same key from df_base against multiple datasets res <- joinless_multiple( x = df_base, ys = list(a = df_a, b = df_b), x_vars = "id", y_vars = c("id", "id_alt"), info = TRUE )df_base <- data.frame(id = 1:5, value = 1:5) df_a <- data.frame(id = 3:7, value = 3:7) df_b <- data.frame(id_alt = 1:5, value = 11:15) # Compare the same key from df_base against multiple datasets res <- joinless_multiple( x = df_base, ys = list(a = df_a, b = df_b), x_vars = "id", y_vars = c("id", "id_alt"), info = TRUE )