`files`

File pattern normalization utilities for S3 and MFT path analysis.

`FileDetection` `dataclass`

A custom detection rule for :func:detect_file_pattern.

Attributes:

Name	Type	Description
`pattern`	`str`	Regex applied to the working string (after prior substitutions).
`replacement`	`str`	Replacement string, e.g. `"{COST_CENTER}"` or `"<SAP>"`.
`substring`	`str \| None`	If given, the rule is skipped when this literal is absent from the original basename — cheap fast-path for context-specific rules.

Source code in src/pinky_core/files.py

@dataclass
class FileDetection:
    """A custom detection rule for :func:`detect_file_pattern`.

    Attributes:
        pattern: Regex applied to the working string (after prior substitutions).
        replacement: Replacement string, e.g. ``"{COST_CENTER}"`` or ``"<SAP>"``.
        substring: If given, the rule is skipped when this literal is absent from
            the *original* basename — cheap fast-path for context-specific rules.
    """

    pattern: str
    replacement: str
    substring: str | None = None

`detect_file_pattern(filepath, custom_detections=None)`

Normalize a filename to a canonical pattern by replacing variable segments.

Replaces dates, timestamps, UUIDs, ISO country codes, structured business identifiers (SIRET, SIREN, IBAN, BIC, VAT), and generic numeric IDs with named placeholders — enabling grouping of structurally similar filenames regardless of the specific value they carry.

Detection is purely structural: no business-context hint is required. Substitutions run most-specific first to avoid double-replacement.

Placeholder conventions:

{...} — temporal or generic numeric variable (date, timestamp, UUID, ID)
<...> — typed structured identifier (country code, SIRET, IBAN…)

Substitution order:

Timestamps (most digits, most specific)
UUIDs
Dates (delimited, then bare YYYYMMDD)
Year-month combinations
ISO country codes (alpha-3 before alpha-2)
Structured business identifiers (IBAN, SIRET, SIREN, VAT_FR)
Generic numeric segments (least specific)

Note: BIC/SWIFT codes (e.g. BNPAFRPP) are structurally indistinguishable from 8-letter uppercase words and cannot be detected reliably without a bank directory. BIC detection is intentionally excluded.

Parameters:

Name	Type	Description	Default
`filepath`	`str`	Full file path or bare filename to normalize. The basename (last `/`-separated segment) is extracted automatically before substitution, so both `"EMPLOYEES_20240115.csv"` and `"s3://bucket/feed/EMPLOYEES_20240115.csv"` produce the same output.	required
`custom_detections`	`list[FileDetection] \| None`	Optional list of :class:`FileDetection` rules applied before built-in substitutions. Each rule's `substring` filter is tested against the full `filepath` (not just the basename) so that directory-level context (e.g. `"workday"` in the S3 prefix) can gate project-specific patterns.	`None`

Returns:

Type	Description
`str`	Pattern string for the basename, with variable segments replaced by
`str`	placeholders.

Examples:

>>> detect_file_pattern("EMPLOYEES_20240115.csv")
'EMPLOYEES_{DATE}.csv'
>>> detect_file_pattern("s3://bucket/feed/EMPLOYEES_20240115.csv")
'EMPLOYEES_{DATE}.csv'
>>> detect_file_pattern("PAYROLL_55208131700027.csv")
'PAYROLL_{SIRET}.csv'
>>> detect_file_pattern("TRANSFER_FR7630004000031234567890143.csv")
'TRANSFER_{IBAN}.csv'

Source code in src/pinky_core/files.py

def detect_file_pattern(
    filepath: str,
    custom_detections: list[FileDetection] | None = None,
) -> str:
    """Normalize a filename to a canonical pattern by replacing variable segments.

    Replaces dates, timestamps, UUIDs, ISO country codes, structured business
    identifiers (SIRET, SIREN, IBAN, BIC, VAT), and generic numeric IDs with
    named placeholders — enabling grouping of structurally similar filenames
    regardless of the specific value they carry.

    Detection is purely structural: no business-context hint is required.
    Substitutions run most-specific first to avoid double-replacement.

    Placeholder conventions:

    - ``{...}`` — temporal or generic numeric variable (date, timestamp, UUID, ID)
    - ``<...>`` — typed structured identifier (country code, SIRET, IBAN…)

    Substitution order:

    1. Timestamps (most digits, most specific)
    2. UUIDs
    3. Dates (delimited, then bare YYYYMMDD)
    4. Year-month combinations
    5. ISO country codes (alpha-3 before alpha-2)
    6. Structured business identifiers (IBAN, SIRET, SIREN, VAT_FR)
    7. Generic numeric segments (least specific)

    Note: BIC/SWIFT codes (e.g. ``BNPAFRPP``) are structurally indistinguishable
    from 8-letter uppercase words and cannot be detected reliably without a bank
    directory. BIC detection is intentionally excluded.

    Args:
        filepath: Full file path or bare filename to normalize.  The basename
            (last ``/``-separated segment) is extracted automatically before
            substitution, so both ``"EMPLOYEES_20240115.csv"`` and
            ``"s3://bucket/feed/EMPLOYEES_20240115.csv"`` produce the same output.
        custom_detections: Optional list of :class:`FileDetection` rules applied
            *before* built-in substitutions. Each rule's ``substring`` filter is
            tested against the **full** ``filepath`` (not just the basename) so
            that directory-level context (e.g. ``"workday"`` in the S3 prefix) can
            gate project-specific patterns.

    Returns:
        Pattern string for the basename, with variable segments replaced by
        placeholders.

    Examples:
        >>> detect_file_pattern("EMPLOYEES_20240115.csv")
        'EMPLOYEES_{DATE}.csv'
        >>> detect_file_pattern("s3://bucket/feed/EMPLOYEES_20240115.csv")
        'EMPLOYEES_{DATE}.csv'
        >>> detect_file_pattern("PAYROLL_55208131700027.csv")
        'PAYROLL_{SIRET}.csv'
        >>> detect_file_pattern("TRANSFER_FR7630004000031234567890143.csv")
        'TRANSFER_{IBAN}.csv'
    """
    basename = os.path.basename(filepath) if "/" in filepath else filepath
    r = basename

    # 0. Caller-supplied rules — run before all built-ins
    # substring is tested against the original filepath so directory context is available
    if custom_detections:
        for rule in custom_detections:
            if rule.substring is None or rule.substring in filepath:
                r = re.sub(rule.pattern, rule.replacement, r)

    # 1. Timestamps — anchored to avoid matching inside longer digit sequences
    r = re.sub(r"(?<![0-9])[0-9]{8}[-._/:][0-9]{6}(?![0-9])", "{TIMESTAMP}", r)
    r = re.sub(r"(?<![0-9])(19|20)[0-9]{12}(?![0-9])", "{TIMESTAMP_YYYYMMDDHHMMSS}", r)
    r = re.sub(r"(?<![0-9])(19|20)[0-9]{10}(?![0-9])", "{TIMESTAMP_YYYYMMDDHHMM}", r)

    # 2. UUIDs (self-anchoring via hyphens)
    r = re.sub(
        r"[a-fA-F0-9]{8}-[a-fA-F0-9]{4}-[a-fA-F0-9]{4}-[a-fA-F0-9]{4}-[a-fA-F0-9]{12}",
        "{UUID}",
        r,
    )

    # 3. Dates — delimited formats before bare YYYYMMDD, all anchored
    r = re.sub(r"(?<![0-9])[0-9]{4}[-._/][0-9]{2}[-._/][0-9]{2}(?![0-9])", "{DATE}", r)
    r = re.sub(r"(?<![0-9])[0-9]{2}[-._/][0-9]{2}[-._/][0-9]{4}(?![0-9])", "{DATE}", r)
    r = re.sub(r"(?<![0-9])(19|20)[0-9]{6}(?![0-9])", "{DATE}", r)

    # 4. Year-month — delimited before bare YYYYMM, all anchored
    r = re.sub(
        r"(?<![0-9])(19|20)[0-9]{2}[-._/](0[1-9]|1[0-2])(?![0-9])", "{YEAR_MONTH}", r
    )
    r = re.sub(
        r"(?<![0-9])(0[1-9]|1[0-2])[-._/](19|20)[0-9]{2}(?![0-9])", "{YEAR_MONTH}", r
    )
    r = re.sub(r"(?<![0-9])(19|20)[0-9]{2}(0[1-9]|1[0-2])(?![0-9])", "{YEAR_MONTH}", r)
    r = re.sub(r"(?<![0-9])(0[1-9]|1[0-2])(19|20)[0-9]{2}(?![0-9])", "{YEAR_MONTH}", r)

    # 5. ISO country codes — alpha-3 before alpha-2 (more specific first)
    r = re.sub(f"_({_ISO3})_", "_<ISO3_COUNTRY>_", r)
    r = re.sub(f"^({_ISO3})_", "<ISO3_COUNTRY>_", r)
    r = re.sub(f"_({_ISO2})_", "_<ISO2_COUNTRY>_", r)
    r = re.sub(f"^({_ISO2})_", "<ISO2_COUNTRY>_", r)

    # 6. Structured business identifiers — context-free, anchored by non-alphanumeric boundaries
    # IBAN: 2-letter country + 2-digit check + 11-30 alphanum (run before VAT_FR)
    r = re.sub(r"(?<![A-Z0-9])[A-Z]{2}[0-9]{2}[A-Z0-9]{11,30}(?![A-Z0-9])", "{IBAN}", r)
    # FR VAT number: FR + 2 alphanum + 9 digits (run after IBAN to avoid partial match)
    r = re.sub(r"(?<![A-Z0-9])FR[A-Z0-9]{2}[0-9]{9}(?![A-Z0-9])", "{VAT_FR}", r)
    # SIRET: exactly 14 digits — timestamps starting with 19/20 already consumed above
    r = re.sub(r"(?<![0-9])[0-9]{14}(?![0-9])", "{SIRET}", r)
    # SIREN: exactly 9 digits
    r = re.sub(r"(?<![0-9])[0-9]{9}(?![0-9])", "{SIREN}", r)

    # 7. Generic numeric segments (least specific, run last)
    r = re.sub(r"_([0-9]{6,})_", "_{ID}_", r)
    r = re.sub(r"_([0-9]+)_", "_{NUM}_", r)
    r = re.sub(r"_([0-9]+)\.", "_{NUM}.", r)
    r = re.sub(r"^([0-9]+)_", "{NUM}_", r)
    # Numbers immediately after a closing placeholder brace
    r = re.sub(r"\}([0-9]+)", "}{NUM}", r)
    # Trailing number before extension (not preceded by > or })
    r = re.sub(r"([^}>])([0-9]+)\.([a-z]+)$", r"\1{NUM}.\3", r)

    return r

files

FileDetection dataclass

detect_file_pattern(filepath, custom_detections=None)

`files`

`FileDetection` `dataclass`

`detect_file_pattern(filepath, custom_detections=None)`