Skip to content

Demo Dataset (Synthetic E-commerce Customers)

Overview

safefeat ships with a small synthetic dataset that mimics real e-commerce customer behaviour: sessions, browsing funnels, purchases, refunds, and support tickets.

This dataset is fully synthetic and contains no real customer information.

Loading the dataset

from safefeat.datasets import load_customer_demo

events, spine = load_customer_demo()

Dataset structure

customer_events.csv

An event log of customer activity.

Example row:

entity_id   event_time          session_id  event_type amount  channel  device  product_category payment_method
cust_00001  2023-02-22 01:07:37 s_0000022   visit       0.0    organic   web       books              NaN

Key columns

  • entity_id: Customer identifier.
  • event_time: Timestamp of the event.
  • session_id: Session identifier.
  • event_type: One of: visit, view, add_to_cart, purchase, refund, support_ticket.
  • amount: Net transaction amount. Positive for purchases, negative for refunds, 0.0 otherwise.
  • channel: Acquisition or interaction channel. May be missing.
  • device: web, ios, android
  • product_category: Product category associated with the event. May be missing.
  • payment_method: Present only for purchase events

customer_spine.csv

The modelling spine defines what is predicted and when.

entity_id   cutoff_time  churned
cust_00001  2024-01-01   0

Columns

  • entity_id: Customer identifier.
  • cutoff_time: The prediction time. Features must be computed using only data at or before this timestamp.
  • churned: Binary label. See definition below.

Label definition

At a given cutoff_time, a customer is labelled as churned if they have been inactive (no events) for more than 90 days prior to the cutoff.

This label is computed using only events with event_time <= cutoff_time.

Multiple cutoffs

The spine may contain multiple rows per customer, e.g. monthly cutoffs. Each row is a separate “snapshot”:

“As of this date, using only historical data, what features can we compute and what is the churn label?”

This matches real production usage where customers are scored repeatedly (weekly/monthly).

Point-in-time safety

The event log includes activity after some cutoff times.

When computing features with:

allowed_lag="0s"

safefeat ensures that only events at or before each cutoff_time contribute to feature values.

Example: computing features point-in-time

from safefeat.datasets import load_customer_demo
from safefeat import build_features, WindowAgg, RecencyBlock

events, spine = load_customer_demo()

spec = [
    WindowAgg(
        table="events",
        windows=["7D", "30D", "90D"],
        metrics={
            "*": ["count"],
            "amount": ["sum", "mean"],
        },
    ),
    RecencyBlock(table="events"),
]

X = build_features(
    spine=spine[["entity_id", "cutoff_time"]],
    tables={"events": events},
    spec=spec,
    event_time_cols={"events": "event_time"},
    allowed_lag="0s",
)

X.head()