Computer VisionApr 22, 20269 min read

Why GoTrack uses a FAISS reranker on top of CLIP.

Pickup detection isn’t a classifier — it’s a similarity problem. We index live frames against a per-store catalogue and let a reranker decide what shopper actually grabbed.

GoTrack team

Computer-vision retail

Why GoTrack uses a FAISS reranker on top of CLIP.

We tried to ship GoTrack with a classifier. It didn’t work. Not because the model was bad, but because the problem isn’t a classification problem — it’s a similarity problem. Stores don’t have one rack of T-shirts; they have 80 SKUs sharing the same silhouette, and shoppers grab whatever they grab. A classifier needs you to train one head per class. A reranker needs you to embed the catalogue once.

From classifier to retrieval

We split the pipeline into two stages. Stage one runs CLIP on the live frame and the per-store catalogue, then queries FAISS for the top-K nearest catalogue entries. Stage two reranks those K with seven extra signals — pose, hand position, dwell time, prior pickups in the session, time-of-day weighting, stock state, and the rack’s zone tier.

# Stage 1: vector recall
emb = clip.encode_image(frame)
candidates = faiss.search(emb, k=20)

# Stage 2: rerank with side-features
ranked = reranker.rerank(
    candidates,
    pose=pose_emb,
    hand=hand_position,
    dwell=dwell_seconds,
    history=session_history,
    tod=time_of_day,
    stock=stock_state,
    tier=zone_tier,
)
pickup = ranked[0]

Why two stages

Two stages let us trade speed and accuracy independently. CLIP + FAISS is fast and approximate; the reranker is slower but informed. We never run the reranker on more than 20 candidates. That’s how we hit a sub-300 ms pickup-to-signage swap on commodity edge hardware.

What the seven signals buy us

Pose disambiguates two visually similar SKUs when one has a different fold pattern.
Hand position kills false positives where a shopper walks past without grabbing.
Dwell time biases against quick browse fly-bys.
Session history boosts adjacency — if you just picked a small, you’re likely picking another small.
Time-of-day reweights against the rack’s historical pickup distribution.
Stock state demotes anything we’re out of (you can’t pick a phantom).
Zone tier gives a small lift to higher-margin or campaign-flagged items.

Per-store, not per-tenant

FAISS indexes are per-store, not per-tenant. Two stores in the same chain often run different SKU sets, different layouts, and different stock states. Building one index per store is cheap (CLIP embeddings are tiny), and it cuts our recall noise in half.

What we measured

On the boutique pilot, the FAISS+reranker pipeline produced a 22% lift in adjacent-SKU pickups versus our v1 classifier baseline, and the false-positive rate (signage swaps with no pickup) dropped from 11% to under 2%. Sub-300 ms end-to-end stayed comfortable on a single overhead camera per rack.

“A pickup is a similarity to a vector you already indexed. Stop classifying; start retrieving.”

From classifier to retrieval

Why two stages

What the seven signals buy us

Per-store, not per-tenant

What we measured

Want this for your business?