Given a profile_data() profile (which carries cross-column coverage
information), suggests how many rows to synthesize so that the synthetic
data can still represent every category combination and every category
level observed in the original data, without blindly matching a large
original row count.
Usage
suggest_min_rows(
profile,
roles = NULL,
data = NULL,
k = 5L,
threshold = 1000L,
cap = 5000L
)Arguments
- profile
A
dataganger_profilefromprofile_data().- roles
Optional; a
dataganger_rolesobject. When provided together withdata, the coverage computation is filtered to only the columns whose effective role is synthesizable (excludes ID candidates, free text, and user-excluded columns).- data
Optional; the original data frame. When provided alongside
roles, coverage is recomputed on the filtered column subset so that the suggestion reacts to role changes on the Configure page.- k
Reserved for a future k-anonymity-style cell-size floor; unused by the current coverage rule.
- threshold
Row count at or above which a reduction is suggested.
- cap
Maximum suggested row count from combination coverage.
Value
A list with:
- n
Suggested integer row count.
- rationale
Human-readable explanation.
- original_n
Original row count.
- combination_count
Observed category-combination count (or
NA).- floor
Per-column distinct floor used (or
NA).- capped
TRUEif the cap bound the suggestion.- reduced
TRUEif the suggestion is below the original count.
Details
The rule (coverage-based) is:
For small inputs (fewer than
thresholdrows, default 1000) the original row count is kept — there is nothing to gain from reducing.Otherwise the suggestion is the number of observed cross-column category combinations, capped at
cap(default 5000) to avoid suggesting millions of rows on wide data, and floored at the largest per-column distinct count so every level remains representable. The suggestion never exceeds the original row count.
Continuous columns are covered by preserving their min/max (already handled by the synthesis engine); they do not raise the suggested count.
Examples
p <- profile_data(datasets::iris)
suggest_min_rows(p)
#> $n
#> [1] 150
#>
#> $rationale
#> [1] "Original is small (150 rows); synthesizing the same number."
#>
#> $original_n
#> [1] 150
#>
#> $combination_count
#> [1] NA
#>
#> $floor
#> [1] NA
#>
#> $capped
#> [1] FALSE
#>
#> $reduced
#> [1] FALSE
#>
