Skip to content

files

File pattern normalization utilities for S3 and MFT path analysis.

FileDetection dataclass

A custom detection rule for :func:detect_file_pattern.

Attributes:

Name Type Description
pattern str

Regex applied to the working string (after prior substitutions).

replacement str

Replacement string, e.g. "{COST_CENTER}" or "<SAP>".

substring str | None

If given, the rule is skipped when this literal is absent from the original basename — cheap fast-path for context-specific rules.

Source code in src/pinky_core/files.py
41
42
43
44
45
46
47
48
49
50
51
52
53
54
@dataclass
class FileDetection:
    """A custom detection rule for :func:`detect_file_pattern`.

    Attributes:
        pattern: Regex applied to the working string (after prior substitutions).
        replacement: Replacement string, e.g. ``"{COST_CENTER}"`` or ``"<SAP>"``.
        substring: If given, the rule is skipped when this literal is absent from
            the *original* basename — cheap fast-path for context-specific rules.
    """

    pattern: str
    replacement: str
    substring: str | None = None

detect_file_pattern(filepath, custom_detections=None)

Normalize a filename to a canonical pattern by replacing variable segments.

Replaces dates, timestamps, UUIDs, ISO country codes, structured business identifiers (SIRET, SIREN, IBAN, BIC, VAT), and generic numeric IDs with named placeholders — enabling grouping of structurally similar filenames regardless of the specific value they carry.

Detection is purely structural: no business-context hint is required. Substitutions run most-specific first to avoid double-replacement.

Placeholder conventions:

  • {...} — temporal or generic numeric variable (date, timestamp, UUID, ID)
  • <...> — typed structured identifier (country code, SIRET, IBAN…)

Substitution order:

  1. Timestamps (most digits, most specific)
  2. UUIDs
  3. Dates (delimited, then bare YYYYMMDD)
  4. Year-month combinations
  5. ISO country codes (alpha-3 before alpha-2)
  6. Structured business identifiers (IBAN, SIRET, SIREN, VAT_FR)
  7. Generic numeric segments (least specific)

Note: BIC/SWIFT codes (e.g. BNPAFRPP) are structurally indistinguishable from 8-letter uppercase words and cannot be detected reliably without a bank directory. BIC detection is intentionally excluded.

Parameters:

Name Type Description Default
filepath str

Full file path or bare filename to normalize. The basename (last /-separated segment) is extracted automatically before substitution, so both "EMPLOYEES_20240115.csv" and "s3://bucket/feed/EMPLOYEES_20240115.csv" produce the same output.

required
custom_detections list[FileDetection] | None

Optional list of :class:FileDetection rules applied before built-in substitutions. Each rule's substring filter is tested against the full filepath (not just the basename) so that directory-level context (e.g. "workday" in the S3 prefix) can gate project-specific patterns.

None

Returns:

Type Description
str

Pattern string for the basename, with variable segments replaced by

str

placeholders.

Examples:

>>> detect_file_pattern("EMPLOYEES_20240115.csv")
'EMPLOYEES_{DATE}.csv'
>>> detect_file_pattern("s3://bucket/feed/EMPLOYEES_20240115.csv")
'EMPLOYEES_{DATE}.csv'
>>> detect_file_pattern("PAYROLL_55208131700027.csv")
'PAYROLL_{SIRET}.csv'
>>> detect_file_pattern("TRANSFER_FR7630004000031234567890143.csv")
'TRANSFER_{IBAN}.csv'
Source code in src/pinky_core/files.py
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
def detect_file_pattern(
    filepath: str,
    custom_detections: list[FileDetection] | None = None,
) -> str:
    """Normalize a filename to a canonical pattern by replacing variable segments.

    Replaces dates, timestamps, UUIDs, ISO country codes, structured business
    identifiers (SIRET, SIREN, IBAN, BIC, VAT), and generic numeric IDs with
    named placeholders — enabling grouping of structurally similar filenames
    regardless of the specific value they carry.

    Detection is purely structural: no business-context hint is required.
    Substitutions run most-specific first to avoid double-replacement.

    Placeholder conventions:

    - ``{...}`` — temporal or generic numeric variable (date, timestamp, UUID, ID)
    - ``<...>`` — typed structured identifier (country code, SIRET, IBAN…)

    Substitution order:

    1. Timestamps (most digits, most specific)
    2. UUIDs
    3. Dates (delimited, then bare YYYYMMDD)
    4. Year-month combinations
    5. ISO country codes (alpha-3 before alpha-2)
    6. Structured business identifiers (IBAN, SIRET, SIREN, VAT_FR)
    7. Generic numeric segments (least specific)

    Note: BIC/SWIFT codes (e.g. ``BNPAFRPP``) are structurally indistinguishable
    from 8-letter uppercase words and cannot be detected reliably without a bank
    directory. BIC detection is intentionally excluded.

    Args:
        filepath: Full file path or bare filename to normalize.  The basename
            (last ``/``-separated segment) is extracted automatically before
            substitution, so both ``"EMPLOYEES_20240115.csv"`` and
            ``"s3://bucket/feed/EMPLOYEES_20240115.csv"`` produce the same output.
        custom_detections: Optional list of :class:`FileDetection` rules applied
            *before* built-in substitutions. Each rule's ``substring`` filter is
            tested against the **full** ``filepath`` (not just the basename) so
            that directory-level context (e.g. ``"workday"`` in the S3 prefix) can
            gate project-specific patterns.

    Returns:
        Pattern string for the basename, with variable segments replaced by
        placeholders.

    Examples:
        >>> detect_file_pattern("EMPLOYEES_20240115.csv")
        'EMPLOYEES_{DATE}.csv'
        >>> detect_file_pattern("s3://bucket/feed/EMPLOYEES_20240115.csv")
        'EMPLOYEES_{DATE}.csv'
        >>> detect_file_pattern("PAYROLL_55208131700027.csv")
        'PAYROLL_{SIRET}.csv'
        >>> detect_file_pattern("TRANSFER_FR7630004000031234567890143.csv")
        'TRANSFER_{IBAN}.csv'
    """
    basename = os.path.basename(filepath) if "/" in filepath else filepath
    r = basename

    # 0. Caller-supplied rules — run before all built-ins
    # substring is tested against the original filepath so directory context is available
    if custom_detections:
        for rule in custom_detections:
            if rule.substring is None or rule.substring in filepath:
                r = re.sub(rule.pattern, rule.replacement, r)

    # 1. Timestamps — anchored to avoid matching inside longer digit sequences
    r = re.sub(r"(?<![0-9])[0-9]{8}[-._/:][0-9]{6}(?![0-9])", "{TIMESTAMP}", r)
    r = re.sub(r"(?<![0-9])(19|20)[0-9]{12}(?![0-9])", "{TIMESTAMP_YYYYMMDDHHMMSS}", r)
    r = re.sub(r"(?<![0-9])(19|20)[0-9]{10}(?![0-9])", "{TIMESTAMP_YYYYMMDDHHMM}", r)

    # 2. UUIDs (self-anchoring via hyphens)
    r = re.sub(
        r"[a-fA-F0-9]{8}-[a-fA-F0-9]{4}-[a-fA-F0-9]{4}-[a-fA-F0-9]{4}-[a-fA-F0-9]{12}",
        "{UUID}",
        r,
    )

    # 3. Dates — delimited formats before bare YYYYMMDD, all anchored
    r = re.sub(r"(?<![0-9])[0-9]{4}[-._/][0-9]{2}[-._/][0-9]{2}(?![0-9])", "{DATE}", r)
    r = re.sub(r"(?<![0-9])[0-9]{2}[-._/][0-9]{2}[-._/][0-9]{4}(?![0-9])", "{DATE}", r)
    r = re.sub(r"(?<![0-9])(19|20)[0-9]{6}(?![0-9])", "{DATE}", r)

    # 4. Year-month — delimited before bare YYYYMM, all anchored
    r = re.sub(
        r"(?<![0-9])(19|20)[0-9]{2}[-._/](0[1-9]|1[0-2])(?![0-9])", "{YEAR_MONTH}", r
    )
    r = re.sub(
        r"(?<![0-9])(0[1-9]|1[0-2])[-._/](19|20)[0-9]{2}(?![0-9])", "{YEAR_MONTH}", r
    )
    r = re.sub(r"(?<![0-9])(19|20)[0-9]{2}(0[1-9]|1[0-2])(?![0-9])", "{YEAR_MONTH}", r)
    r = re.sub(r"(?<![0-9])(0[1-9]|1[0-2])(19|20)[0-9]{2}(?![0-9])", "{YEAR_MONTH}", r)

    # 5. ISO country codes — alpha-3 before alpha-2 (more specific first)
    r = re.sub(f"_({_ISO3})_", "_<ISO3_COUNTRY>_", r)
    r = re.sub(f"^({_ISO3})_", "<ISO3_COUNTRY>_", r)
    r = re.sub(f"_({_ISO2})_", "_<ISO2_COUNTRY>_", r)
    r = re.sub(f"^({_ISO2})_", "<ISO2_COUNTRY>_", r)

    # 6. Structured business identifiers — context-free, anchored by non-alphanumeric boundaries
    # IBAN: 2-letter country + 2-digit check + 11-30 alphanum (run before VAT_FR)
    r = re.sub(r"(?<![A-Z0-9])[A-Z]{2}[0-9]{2}[A-Z0-9]{11,30}(?![A-Z0-9])", "{IBAN}", r)
    # FR VAT number: FR + 2 alphanum + 9 digits (run after IBAN to avoid partial match)
    r = re.sub(r"(?<![A-Z0-9])FR[A-Z0-9]{2}[0-9]{9}(?![A-Z0-9])", "{VAT_FR}", r)
    # SIRET: exactly 14 digits — timestamps starting with 19/20 already consumed above
    r = re.sub(r"(?<![0-9])[0-9]{14}(?![0-9])", "{SIRET}", r)
    # SIREN: exactly 9 digits
    r = re.sub(r"(?<![0-9])[0-9]{9}(?![0-9])", "{SIREN}", r)

    # 7. Generic numeric segments (least specific, run last)
    r = re.sub(r"_([0-9]{6,})_", "_{ID}_", r)
    r = re.sub(r"_([0-9]+)_", "_{NUM}_", r)
    r = re.sub(r"_([0-9]+)\.", "_{NUM}.", r)
    r = re.sub(r"^([0-9]+)_", "{NUM}_", r)
    # Numbers immediately after a closing placeholder brace
    r = re.sub(r"\}([0-9]+)", "}{NUM}", r)
    # Trailing number before extension (not preceded by > or })
    r = re.sub(r"([^}>])([0-9]+)\.([a-z]+)$", r"\1{NUM}.\3", r)

    return r