ZhiZhi Gewu

The Last Mile From Analysis (Answers) to a Data Mart (Assets)

KY John — Sun, 21 Jun 2026 08:16:16 GMT

An analyst’s notebook is not a pipeline — and the gap is wider than it looks

Most useful analysis dies in a notebook. Someone writes a query that answers a real question, the answer ships in a deck, and the query is never seen again. When the same question comes back a month later, someone rewrites it — slightly differently, with a slightly different number.

The fix is well understood: turn that query into a durable asset in a data mart, scheduled, validated, and reusable. The problem is that the distance between “a query that returns the right answer” and “a pipeline asset an organization can depend on” is much larger than it appears, and the people best placed to close it — the analysts who understand the business logic — are usually the least incentivized to do so.

I spent a stretch recently converting a piece of validated analysis into a proper mart table, with an AI coding assistant as a pair. It changed my view of where the real barriers are, what AI actually moves, and — importantly — what it doesn’t.

The technical barrier: a wall made of small cliffs

A working analytical query and a production asset are different kinds of object. Crossing between them means absorbing a stack of concerns that have nothing to do with the question you set out to answer:

One-shot becomes idempotent. A SELECT you run once becomes a write that must produce the same result every time it runs, overwrite cleanly on re-run, and target exactly one partition per run.
Implicit schema becomes an explicit contract. Column names, types, and order stop being incidental and become a contract that has to stay in sync across the SQL, a schema spec, and a DDL definition — with automated checks that fail the build if they drift.
“Run it now” becomes scheduling. You inherit rolling date macros, dev/prod parametrization, and a surprising number of ways to get the date arithmetic subtly wrong.
Standalone becomes dependent. The asset has upstreams, and the platform wants you to declare them — readiness signals, ordering, what waits on what.
“Trust me” becomes CI (Continuous Integration). Naming rules, schema-match checks, dependency checks, catalog generation — a gauntlet your change must pass before anyone will look at it.
One run becomes a backfill. Now you must reason about history: how far back, in what order, and whether re-running is safe.

None of these is hard in isolation. Each is a small cliff. Stacked together, they’re a wall — and every one is an opportunity to ship something subtly wrong.

The incentive barrier: the one nobody puts in the diagram

The technical barrier is the one people talk about. The incentive barrier is the one that actually keeps the asset from getting built.

Analysts are measured on answers and on speed, not on maintainable assets. Productionizing a query is slow, its payoff is deferred, and the credit largely accrues to whoever uses the asset later. The platform itself has a learning curve, and that cost is paid by the individual for a benefit that is mostly collective. The “correct” alternative — hand it to a data engineering team — has its own friction: a queue, a spec, a handoff in which the very business context that made the analysis correct gets lossy, and the analyst demoted to a ticket-filer chasing their own request.

So the rational analyst, most of the time, just keeps the query in the notebook. The asset never gets built. The knowledge stays ephemeral. This is not a failure of diligence; it’s a predictable response to the incentives. The barrier was never only “can they?” — it’s also “is it worth it to them?” And often, honestly, it isn’t.

Where AI moves the line

This is the part that surprised me. The assistant was most valuable not as a code generator but as a translator and navigator — it collapsed the part of the wall that is pure tax.

Boilerplate and compliance. Scaffolding the asset to the repo’s conventions, keeping the schema/DDL/spec in lockstep, generating the dependency wiring, and grinding the change through the CI gates until they were green. The “translation tax” from analysis idiom to engineering idiom dropped sharply.
Platform navigation. The tribal knowledge — how deploys work, which date-macro forms silently misbehave, what a readiness marker actually does — is normally extracted from an engineer’s calendar. Here it was a conversation.
Tight validation loops. Replicate the trusted number, diff, hypothesize the cause of a mismatch, fix, re-run. Fast.
Operational scaffolding. When a backfill ran, the assistant watched the outputs land and flagged — quickly — that something was off.

The effect on the incentive math is the real story. When the fixed cost of productionizing drops, more analyses clear the bar where benefit exceeds cost. Work that wasn’t worth a handoff becomes worth doing yourself. The analyst can own more of the last mile — for localized, individual or small-team needs — without a full throw-over-the-wall.

Where AI did not — and could not — carry the weight

Here’s the part I want to be honest about, because the temptation is to stop at the previous section.

Every consequential moment in the project was a matter of judgment and skepticism, not typing — and those were mine to supply.

A backfill “succeeded” and wrote a tidy set of partitions. They were all empty. Understanding why required knowing that the table was anchored on a single point in a synchronized billing cycle, so most calendar dates legitimately have no rows — the empties were correct, not a bug. The assistant helped me run that down quickly, but only after I distrusted a green checkmark and asked the question.

Two tables looked joinable on their partition date. They aren’t — the same entity is anchored on different events in each, so the only correct join key is the entity id, and joining on the date would have silently produced almost nothing. That’s a contract-level fact about the data model, not something to infer from syntax.

An upstream turned out to be a view with no completion signal, which meant the “obvious” way to wire a dependency would have made the job wait forever — and the right response depended on understanding the difference between scheduled runs and backfills. And when I floated a redesign that would have made backfilling trivially easy, the right call was to not do it, because it would have traded away a clean one-row-per-entity contract and invited double-counting. That’s an architectural value judgment.

The pattern is consistent: AI accelerates generation and investigation; the human supplies the right questions, the skepticism, and the trade-off decisions about contracts, semantics, and scale. The moments that mattered were someone saying “this number looks wrong” or “but these segments don’t behave the same” — and then the tooling helped chase it down.

And note where the guardrails came from. The schema checks, the dependency rules, the review, the conventions the assistant dutifully satisfied — engineers built those. The AI operated within that frame; it did not invent it. Take the frame away and the same assistant will happily generate a swamp of plausible, subtly wrong, unmaintained assets.

Not replacement — redistribution

I don’t think any of this means an organization needs less data engineering. It means the boundary of what an analyst can responsibly own has moved.

The roughly 80% that is boilerplate and platform-navigation becomes self-serve. The 20% that is architecture, shared contracts, reliability, and scale stays with engineers — and with the analyst’s own judgment, now better-informed because the tooling made it cheap to investigate. The healthy version of this is fewer trivial tickets in the engineering queue, more analyses becoming durable assets, and engineers freed for the genuinely hard platform work that makes analyst self-serve safe in the first place. The better the guardrails, the more an analyst can be trusted to do alone — so investing in the platform matters more, not less.

The failure version is just as easy to reach: hand people a code generator without the guardrails or the skepticism, and you get a lake of confident, broken pipelines that nobody owns. The win is conditional.

So the honest summary is small and, I think, durable. The barrier to turning analysis into engineering was never only technical — it was also about incentives, and AI lowers both. An analyst can now cross more of that last mile on their own. But the bridge still rests on engineering foundations they didn’t build, and on judgment no model supplied for them.

I Let a Neural Network Do My Feature Engineering — Then Made It Explain Itself

KY John — Sat, 13 Jun 2026 01:35:18 GMT

An experiment in automated feature discovery for credit risk: a sequential model trained on raw payment history, interrogated with explainability tools, and translated into 52 plain tabular features that closed the entire gap to the neural model on holdout. Spread across two calendar days — with the human time measured in hours, not weeks.

The job that eats our weeks

If you build behavioral risk models, you know the drill. The signal lives in history tables — millions of payment events, monthly card balances, bureau records — and the only way to feed that to a scorecard or GBM is to collapse it into aggregates: mean days past due in the last 6 months, max utilization ever, count of late payments in 12.

The pain is combinatorial. Every variable × every window (3/6/12/24 months) × every statistic (mean, max, count, trend) × every condition you can dream up. Thousands of candidates. Weeks of analyst time. And when you ship, a quiet doubt remains: what did we not think of?

So I ran an experiment around a simple inversion: what if the model does the searching, and the analyst only does the reading?

The idea: the network is the search engine, not the product

The plan had three stages, each with a hard, falsifiable criterion:

Train a sequential neural network on near-raw history — no hand-crafted aggregates anywhere in its inputs — and require it to beat a meaningful bar: validation AUC > 0.78. The data: a private competition dataset modeled on the well-known Home Credit Kaggle data — an application table plus six history tables; 246K applications, 8.1% bad (default) rate, 13.6M payment events plus card, point-of-sale and credit bureau tables. (The raw data can’t be redistributed, but the public Kaggle Home Credit dataset has the same shape if you want to reproduce the method.)
Interrogate the frozen model with explainability tooling and write down, in plain language, what it learned.
Rebuild the findings as ordinary tabular features, hand them to a plain LightGBM, and measure how much of the network’s edge they keep — on a holdout set touched exactly once.

The crucial design decision: the neural network was never the deliverable. The deliverable was the feature list. The deployable model at the end is a boring, governance-friendly GBM over named columns — the network exists only to search a hypothesis space no analyst can cover by hand.

One constraint made the whole thing meaningful: the network’s inputs were raw event streams (due date, paid date, amounts, statuses…), with only “unit fixes” added — paid minus due = delay, paid over due = ratio. No windows, no decays, no counts-over-time. If we had fed it engineered aggregates, interrogating it would just have echoed our own assumptions back at us.

What actually happened (including the failures)

A disclosure that matters for everything that follows: I directed an AI coding agent (Claude Code) through the entire build — it wrote the pipeline, ran the experiments, and drafted the analyses; my own keyboard time across the two days went into design decisions at the start, a long stream of “explain this to me” challenges in the middle, and reading findings and approving feature names at the end. The honest log matters more than the highlight reel:

The first model failed the gate. It stalled at 0.767 validation AUC — barely above the applications-only baseline of 0.763 (validation). Before burning more GPU time, I had a set of independent AI reviewer agents audit the data pipeline, each instructed to try to refute the others’ bug reports so only solid findings survived. Three silent scaling bugs were confirmed. The worst: a clipping step had collapsed every applicant’s age to the same constant. Age — one of the strongest variables in retail credit — had been deleted from the model, and nothing crashed; the model trained merrily at a plausible-looking AUC. If you take one engineering lesson from this post: scaling bugs don’t announce themselves. Audit the transformed arrays — nunique(), clip-saturation — not just the raw data.

Then the neural fusion head hit a ceiling. With the bugs fixed and standard tricks applied (an auxiliary head forcing the behavioral embedding to be predictive on its own; randomly hiding the application features during training so the model can’t lean on them), the network’s own head still topped out around 0.769 (validation). Neural nets are simply mediocre at flat tabular data, where GBMs shine. The fix was architectural honesty: keep the network as the sequence reader, freeze its 128-number behavioral embedding, and let a LightGBM do the fusion — raw application columns and embedding columns side by side. That, plus a small ensemble of encoder variants, passed the gate at 0.7816 (validation).

The interrogation: making the black box talk

This is the part I’d never done before, and it’s where the experiment earns its keep. Three tools plus one final probe, deliberately different in mechanism, all run on the frozen model:

1. Attribution (Integrated Gradients) decomposes the behavioral score into per-event, per-channel contributions — a points table for a model that has no points table. It told us where the model looks: payment amounts and delays dominate, concentrated in the most recent ~10–15 installments; the monthly view is read as a current portfolio snapshot (the top channel was scheduled future installments — forward commitment burden, not delinquency); bureau signal concentrates in how recently accounts were opened. Best of all, plotting attribution against time produced a forgetting curve per data source — payments decay fast, applications and bureau records slowly. The model had learned different memory lengths for different data, something a uniform “last 12 months” window would have flattened.

2. Counterfactual perturbation is stress-testing aimed at the model: edit real histories surgically, re-score, measure. Injecting a fresh 3-payment late streak: +14.8 percentage points of predicted default probability in the top-risk decile. The same streak placed 18 months back: +0.9. That ratio is a feature specification — recency-discounted delinquency with a half-life around five to six payments, which became a literal column (Σ late × 0.89^k, k = payments back from today). Deleting the older half of someone’s history — mostly clean events — raised their risk: tenure itself is protective. And the null results mattered just as much: chronic mild underpayment, days-past-due (DPD) on monthly statements, utilization trend — all moved the model barely at all. Each null is strong evidence (not proof — the network can miss things too) that a feature family isn’t worth building for this portfolio, with a measured justification attached.

3. Embedding analysis treats the model’s internal customer summary (the encoders’ 128-number embeddings, concatenated) as a map of “behavior space”. Clustering it produced six archetypes whose real default rates ran 2.4% to 25.3% — and here’s the insight I keep retelling: median DPD was ~zero in every archetype. 92% of eventual defaulters had no statement delinquency at all in the prior year. The 10× risk separation came from soft gradients — how early people pay (10 days early vs 5), how deep their history runs, their refusal history. The model invented a segmentation no DPD-based rule could see.

The embedding gave us one final probe, the strangest of the lot: take the model’s main internal risk axis (the combination of embedding numbers that best tracks its score) and regress it on ~23 statistics we could name. They explained 39% of its variance (linear R²). The other 61% was the model knowing something we had no words for — so we pulled the customers at the extremes of that unexplained direction and read their raw files like underwriters. Two patterns emerged that none of our standard vocabulary captured: same-day bursts of refused applications (8–14 refusals, specific rejection codes) and heavy external debt sitting on a thin internal file. Both became features. Both are the kind of thing you find in year three of working a portfolio, not week one.

The verdict

Each finding became one or more named features — 52 in total, every one with a provenance note linking it to the finding that motivated it, implemented in plain pandas with no neural network anywhere at inference. Then the only test that matters, on the untouched holdout (holdout numbers run a little below the validation figures quoted above — expected, and exactly why the holdout exists):

Gap closure: 102% on the point estimates — read that as a statistical tie: the 0.0006 AUC by which the features “beat” the network is well within noise on a single holdout, and I make nothing of it. The claim that matters is that the interpretable features kept essentially everything the network found (a second check agrees: the features-alone model matches the network’s behavior-only score head almost exactly). And no, that doesn’t make the network redundant: it found the features. Without it, the same list would have required exactly the manual search this experiment set out to replace.

Why I think this changes the feature-development workflow

Time. Two calendar days end-to-end, and most of that was compute, not human attention. The human’s job is compressed into: design decisions at the start, reading findings in the middle, naming features at the end.

Features arrive with evidence, not hunches. Every feature came with a measured functional form (the decay constant was fitted by the model and read off a dose-response curve, not guessed), a confirmed direction (useful for monotonicity constraints and model documentation), and — the underrated half — licensed omissions: measured evidence that the network, which saw everything, found no use for certain feature families, so nobody burns a sprint building them “just in case”.

Insights are a first-class output, not a by-product. The defaulters-look-clean finding, the archetype segmentation, the credit-appetite signal in bureau account openings — those are portfolio knowledge with uses far beyond one model: strategy, monitoring, collections prioritization. Manual feature engineering produces features; this produces features and an explanation of the portfolio.

The deployable artifact stays boring. Nothing about production changes: a GBM, named columns, documented provenance. For anyone working under model governance, that’s the difference between “interesting research” and “something I can actually ship”.

What this is not

Honesty section. The method is not human-free: someone chooses the raw representation, reads the extreme cases, and names the features — that’s the point, not a flaw (translation into human language requires a human). The perturbation effects are within-model sensitivities, not causal claims about customers. The probe is linear, so “61% unexplained” is an upper bound on mystery — some of it is interactions of known quantities. This is one dataset, one domain; the playbook needs adapting for very rare outcomes, drifting populations (use out-of-time holdouts!), or domains where the risk signal is what stopped happening. And it needs label volume — supervised embeddings want thousands of positives.

Also: I ran the interrogation as a single pass because that was the experiment’s design. In practice you’d loop — generate features, measure the remaining gap to the network, re-interrogate targeted at the cases where the features and the network disagree most, and repeat until converged. The disagreement cases are exactly where the unextracted signal hides.

Try it

The code and analysis are public: the full pipeline, the interrogation scripts, the findings documents with per-feature provenance, a tutorial written for risk professionals with no neural-network background, and a generalized, self-contained playbook for applying the method to other domains (churn, fraud, claims). The raw dataset itself can’t be redistributed — but the public Kaggle Home Credit data has the same shape if you want to run the method end to end:

https://github.com/kychanbp/auto-feature-gen-with-nn-interogation

My takeaway after two days: feature engineering didn’t disappear. It moved up a level — from writing thousands of candidate aggregates to reading what a model that saw everything chose to care about. I suspect that inversion is where the interesting work is moving.

I knew the rules of Go. Writing them for a machine taught me I didn't understand them.

KY John — Sat, 30 May 2026 02:00:50 GMT

I’m rebuilding AlphaGo from scratch — 9×9 board, no shortcuts — to actually understand how it works instead of just running someone else’s repo. The spark was this podcast, which walks through building AlphaGo from scratch and makes the whole thing feel approachable enough to actually try. Stage 1 was the unglamorous part: the Go engine. Place a stone, capture dead groups, and score the board. I know the rules of Go, so I figured this would be a quick warm-up.

It was not. And the reason it wasn’t is the whole point of this post.

The machine takes nothing for granted

Knowing the rules of Go is enough to follow a game: “these stones are captured,” “that’s obviously territory,” “you can’t replay there, that’s ko.” But knowing a rule well enough to follow it lets you lean on intuition for the messy parts — you recognize the situations without ever pinning down the exact mechanics, because as a human, you never have to.

A computer has none of that slack. To write the engine, I had to turn every piece of “obvious” knowledge into a rule precise enough that a machine with zero common sense could follow it. That process is where I discovered how little I actually understood the rules I thought I knew.

Here are the moments that humbled me.

Pitfall 1: The bug hid exactly where I didn’t look

My first function just listed a point’s neighbors. I wrote it, tested the top-left corner, an edge, and the center — all correct. Moved on.

Then a test of the bottom-right corner returned neighbors that were off the board. The bug: I’d written r + 1 <= size instead of r + 1 < size — a classic off-by-one. My earlier tests had all exercised the top and left edges, where a different check fires. The bug lived in the one region my tests never touched.

Lesson, burned in permanently: a passing test suite doesn’t mean correct code — it means correct code in the cases you tested. Bugs hide in the gaps.

Pitfall 2: the rule I knew, in an order I never questioned

Anyone who knows Go “knows” you remove stones that have no liberties. But here’s a question I’d never had to answer: when you place a stone, what order do you resolve things in?

It turns out the order is load-bearing. You must remove the opponent’s dead stones before checking whether your own move was suicide. Why? Because a move that looks like suicide — your stone dropping into a spot surrounded by enemies — can actually be a capture. Removing the opponent first frees up the very liberty that saves your stone.

I knew the capture rule cold — but I’d never had to ask in what order captures and suicide resolve, because when you’re just following along, the board sorts itself out and you never notice the sequence. Writing it down for the machine forced the rule into focus.

Pitfall 3: the simple rule that wasn’t — ko

I knew ko: you can’t immediately recapture and put the board back the way it was. Simple.

Except the engine needs something stronger — positional superko: you can’t recreate any board position that has ever occurred in the game, not just the last one. And when I asked why, the answer reframed how I think about the whole project: self-play games must be guaranteed to terminate. AlphaGo trains by playing millions of games and scoring the result. A game that loops forever can never be scored. Simple ko only stops the 2-move loop; longer cycles (triple ko and friends) would still hang. Superko forbids all repetition, so every game ends.

I knew the rule. I’d never known what it was for.

(Bonus humbling: my first superko test “proved” the feature was broken. The board kept repeating. The bug wasn’t in the engine — it was in my test: I’d built the position by poking the board directly, which skipped the bookkeeping that records past positions. The engine was fine; I’d lied to it. Even your tests can be wrong.)

Pitfall 4: three functions, three jobs, and the danger of blur

The trickiest stretch was a refactor: I needed one function to mutate the board, another to check a move’s legality without changing anything, and a third to commit a real move. Three jobs.

I kept letting them bleed into each other. One version had the “check” function quietly committing moves, so generating the list of legal moves would have played 80 phantom moves. Another had the commitment silently, not placing the stone at all. Each bug came from the same root: a function doing more than its one job.

This is the kind of clean separation an AI would write correctly on the first try. And that’s precisely why I’m glad I wrote it myself — badly, then fixed it. Reasoning through why the board stopped updating taught me something about side effects and single-responsibility that a correct-on-the-first-try answer never would have.

The actual lesson: AI can write the code. It can’t transfer the understanding.

Through all of this, I had an AI tutor — but deliberately not as a code generator. It refused to hand me answers. It asked leading questions, reviewed what I wrote, pointed out the region my tests didn’t cover, and made me predict every output before running it. I found or fixed every bug above myself.

I could have typed “write me a 9×9 Go engine” and had a working one in seconds. I’d also have understood exactly nothing. The understanding doesn’t live in the code — it lives in the struggle of producing the code: choosing the data structures, predicting outputs, and especially debugging. Debugging is where the abstract rule collides with a concrete failure, and the lesson finally lands.

And the sharpest version of this, for me: implementing the rules of Go taught me the rules of Go — at a precision that simply knowing them never required. Knowing a rule well enough to follow it and understanding it well enough to build it are different kinds of knowledge, and only the second one transfers to the next problem.

That’s the whole reason I’m building this by hand. AlphaGo’s real magic — the search, the self-improving network — is still ahead of me (Stage 2 is Monte Carlo Tree Search). But if Stage 1 taught me anything, it’s that the value isn’t in having the engine. It’s in becoming the kind of person who could have written it.

Next up: teaching a computer to search — without a neural net yet.

A note on process: this post was drafted by AI and reviewed, corrected, and signed off by me, which, given the argument above, feels worth saying out loud. The code, the bugs, and the understanding are mine; the write-up is a collaboration. Using AI to help narrate the journey is fine. Using it to skip the journey would have defeated the entire point.

Why Singapore has no policy rate, the yen collapsed, and Brazil paid you 14% to hold a rising currency?

KY John — Sun, 24 May 2026 08:55:50 GMT

Exchange rates and interest rates shape our everyday lives. The exchange rate affects our purchasing power; the interest rate affects borrowing costs and asset prices. If you earn Singapore dollars, a trip to Japan looks attractive. If you buy a house in Singapore, you’ll notice the mortgage rate moves in line with rates in the United States. Yet most of us don’t know the mechanism underneath — how these variables move and interact.

The Impossible Trilemma

The impossible trilemma states that a country can have only two of three things: control of its interest rate, control of its exchange rate, and free movement of capital.

When capital flows freely, the no-arbitrage relationship between interest rates and the exchange rate is:

where S is the spot rate (domestic currency per unit of foreign), F the forward rate, i_d the domestic interest rate, and i_f the foreign rate.

The formula says: holding one unit of domestic currency (which grows to 1+i_d) must equal converting it to foreign currency at the spot (1/S), earning the foreign rate (1+i_f), and converting it back through a forward contract (F).

The equation is just a constraint — it doesn't say which variable drives the others. That's a choice. A country pins one lever, and the formula then determines the rest.

Take Singapore. The MAS uses the exchange rate, not the interest rate, as its main policy tool — managing the Singapore dollar against a trade-weighted basket (the S$NEER) within a band, usually on a gently appreciating path. When the market expects the Singapore dollar to strengthen, the formula does the rest: F < S , so F/S<1, which pushes SGD interest rates below foreign rates. The MAS never sets the domestic rate directly — it falls out of the currency policy.

But here’s the crucial limit: this formula pins down the forward rate, not the future spot rate. The forward is locked in by covered arbitrage; where the spot actually goes depends on expectations, risk premia, capital flows, and policy credibility.

Japan makes the point. Its interest rate is far below the United States’. The formula puts the yen at a forward premium — so you might expect it to strengthen. In reality, the yen depreciates.

Covered and Uncovered Interest Parity

The formula above is covered interest parity — “covered” because you hedge the currency risk with a forward. It’s a no-arbitrage benchmark, and in deep, liquid markets it holds very closely. (It can drift only when funding stress or balance-sheet constraints open a “cross-currency basis,” as they have at times since 2008.) But notice what it does and doesn’t say: it pins the forward price to the interest-rate gap. It says nothing about where the spot rate will actually go.

That second question belongs to uncovered interest parity — the unhedged version. In theory, a low-yielding currency should appreciate just enough to offset its lower interest rate. In practice this often fails, because investors demand risk premia, chase carry, or react to safe-haven and policy shifts.

Japan is the clearest case. Its low rate puts the yen at a forward premium, but in the spot market traders borrow cheap yen to buy higher-yielding foreign assets, creating selling pressure that pushes the yen down. Switzerland shows the other side: despite equally low rates, the franc tends to appreciate, because it enjoys safe-haven demand.

Notice the pattern: the same low interest rate sends the yen down and the franc up. The interest rate alone never tells you which way a currency moves — what matters is which lever the country chose to control, and what it leaves to the market.

Around the World

China has restricted the free flow of capital, which lets it steer both the interest rate and the exchange rate more independently — because the arbitrage channel that would otherwise link them is constrained. The link isn’t fully severed (trade flows, approved investment channels, and the offshore CNH market still matter), but it’s loose enough that domestic rates and the yuan can diverge from what open-market arbitrage would dictate.

Taiwan faces recurring appreciation pressure from its strong external position and capital inflows. To lean against excessive TWD appreciation and protect export competitiveness, its central bank buys foreign currency and accumulates reserves (now over US$600 billion), while using monetary and prudential tools to manage domestic liquidity.

Brazil, as of 2026, shows the opposite of Japan. With a high policy rate, the real’s forwards imply depreciation under interest parity — yet the spot real has appreciated, because high carry, improved sentiment, and capital inflows more than offset the forward-implied decline.

Closing

Three levers — interest rate, exchange rate, free capital — and you can pin only two. The third is whatever the market makes of it. So the next time your Singapore mortgage tracks the Fed, or your yen holiday gets cheaper, you’re watching the same equation at work — quietly linking interest rates, exchange rates, and the flow of capital across the world.

What's Really Going On in Machine Learning? A Simplest Example by Stephen Wolfram

KY John — Sat, 16 May 2026 02:01:07 GMT

In a Neural Net (NN)^[1], the standard setup includes an input layer, hidden layers that sit between the input and output layer, and neurons that contain the weights, a bias, and a reference to the activation function. The architecture defines the computation graph (visually, that is, the connections between nodes); the forward pass evaluates that graph for a given input. Back-propagation computes gradients of the loss with respect to the parameters; an optimizer such as SGD or Adam then updates the weights.

To understand it, we can simplify as many components as possible. First, let’s restrict connections to local neighbors only — for example, the Mesh Neural Net. Second, let’s make everything discrete instead of continuous. The simplest discrete analog is a rule array — a cellular-automaton-like grid in which each cell independently picks which rule to apply from a small fixed panel (see Appendix for an introduction to cellular automata). The value of a cell is discrete: either 1 or 0. The value of the cell depends on the left, center, and right cells preceding it by a rule set. Analogously, each cell in the grid is a neuron in NN. The row in the grid is the layers. The computation graph is constructed as a mesh network that depends on local neighbors. Random mutation is analogous to a training/optimization procedure, but unlike gradient descent, it does not use gradients. It is closer to stochastic hill climbing or evolutionary search. The value of the cell is similar to activation, the value the neuron produces after the forward pass computation.

Implementation

First, we defined the grid size to be where is the number of time steps and is the width of the grid.

Second, we define a function to generate the rule set, converting decimal to binary (see Appendix for details).

def rule_to_lut(n):
    """Convert rule number to a lookup table."""
    return [ (n // 2**i) % 2 for i in range(8) ]

We will mainly use rule 4 and rule 146. The lookup tables are:

Third, we construct a rule array that indicates whether to apply rule 4 or rule 146 to each cell, and initialize it to zeros.

rule_array = [[0] * W for _ in range(T)]

Fourth, we define a function to evolve the row one by one in the grid, starting with the initial configuration [0000...010...0000]. Then, we loop through the cell from top to bottom and left to right, and apply the corresponding rule set (Rule 4 or Rule 146) based on the value in the rule array (0 or 1).

def evolve(rule_array, initial_row):
    T = len(rule_array)
    W = len(initial_row)
    pat = [list(initial_row)]  # start with the initial row
    for t in range(T):  # loop through time steps
        new_row = []
        for w in range(W):  # loop through cell width
            left = pat[t][w-1] if w-1 >= 0 else pat[t][W-1]  # if hitting left wall, treat as the rightmost cell
            center = pat[t][w]
            right = pat[t][w+1] if w+1 < W else pat[t][0]  # if hitting right wall, treat as the leftmost cell
            idx = left * 4 + center * 2 + right  # convert binary to decimal index

            if rule_array[t][w] == 0:
                new_row.extend([rule_4_lut[idx]])  # apply rule 4
            else:
                new_row.extend([rule_146_lut[idx]])  # apply rule 146
        pat.append(new_row)

    return pat

Fifth, we have to define our loss function. Our goal is to have a Cellular Automata that stops at a certain lifetime, say 30. The definition of lifetime is the index of the last nonzero row. The loss is just the absolute difference between our target and our lifetime.

def lifetime(pat):
    """Calculate the lifetime of the pattern."""

    for t in range(len(pat) - 1, -1, -1):  # start from the last time step; -1 means go backwards; -1 means stop at the first time step
        if any(pat[t]):
            return t
    return 0

def loss(rule_array, target):
    pat = evolve(rule_array, initial_row)
    return abs(lifetime(pat) - target)

Finally, we have to write our training loop that mutates the rule array, similar to how we find the weights in the NN. We start by initializing the rule_array randomly, then flip the bit in the array if the loss is less than or equal to the current loss. The reason to use <= instead of < is that by including the run with the same loss, we avoid getting stuck in the same mutation (weights).

def train(target, max_steps=10000):

    random_rule_array = [[random.choice([0, 1]) for _ in range(W)] for _ in range(T)]  # Initialize a random rule array
    current_loss = loss(random_rule_array, target)
    loss_history = [current_loss]

    for step in range(max_steps):
        # Randomly flip a bit in the rule array
        t = random.randint(0, T-1)
        w = random.randint(0, W-1)
        random_rule_array[t][w] = 1 - random_rule_array[t][w]  # 1-0 = 1 and 1-1 = 0, so this effectively flips the bit

        new_loss = loss(random_rule_array, target)

        if new_loss == 0:
            current_loss = new_loss
            loss_history.append(current_loss)
            break
        elif new_loss <= current_loss:
            current_loss = new_loss
            loss_history.append(current_loss)
        else:  # Revert the change if it doesn't improve the loss
            random_rule_array[t][w] = 1 - random_rule_array[t][w]
            loss_history.append(current_loss)

    return random_rule_array, loss_history

Results and Implications

We can observe that the simplified NN construction did “learn.” However, the solutions seem to have no pattern. Somehow, some rules did minimize the loss function. In fact, the simplest solution is to apply only one rule-146 in row 30.

The main messages from the book by Stephen are:

Many trained systems do not work by discovering clean, human-readable mechanisms. Training is an adaptive process that searches an enormous computational space and retains any behavior that happens to align with the constraints.
ML works because of computational irreducibility (the concept introduced by Stephen Wolfram in A New Kind of Science). You cannot shortcut their behavior with a closed-form theory. ML exploits this by exploring the vast, irreducible computational space, which almost always finds one whose dynamics satisfy the need.
There is a trade-off between explainability (for example, a closed-form solution) and performance, because if we require explainability, we restrict the search in computational irreducible space.

Appendix: Cellular Automata

A one-dimensional Cellular Automata (CA) of two states starts with an array of ones and zeros. The next generation (the next row) is generated by a set of rules that take two adjacent cells and itself as inputs, and output the next cell.

Since there are combinations of inputs, the rule set can be expressed in 8 digits. For example, binary 00011110 = decimal 30; this is called Rule 30 (see Appendix: Base Conversion for more information). Therefore, there are, in total, possible rule sets.

For example, the following is the space-time pattern^[2] of Rule 30:

Appendix: Base Conversion

In Python, the conversion can be achieved in one line by answering whether is in the sum. is in the sum, while is not, as we saw above. It is easier to see in base-10 first. To answer whether is in the sum of () — in other words, what is the tens digit of — we can first chop off the last digit (mathematically, ^[4]) and look at the rightmost digit of what’s left (mathematically, ^[5]). Applying the same logic to base-2 of number , it becomes .

Appendix: Code

Polish by Claude:

"""
Wolfram §5 replication — train a rule array to a target lifetime.

Replicates the lifetime-training experiment from Stephen Wolfram,
*What's Really Going On in Machine Learning? Some Minimal Models* (2025),
Section 5 ("Machine Learning in Discrete Rule Arrays").

Each cell of a T x W rule array selects between two elementary cellular
automata (rule 4, decay; rule 146, chaotic) to apply at that (time, space)
position. From a single black cell in row 0, the pattern evolves forward T
steps. Training searches for a rule array whose pattern survives EXACTLY
`target` steps, using single-point mutation hill-climbing with the rule
"accept iff new loss <= old loss". The equality is what lets the search
drift across plateaus until it stumbles into a downhill move.

Phenomena reproduced:
  1. Random mutation actually trains.
  2. Trained rule arrays look like noise (no obvious mechanism).
  3. Different runs find different solutions.
  4. Plateau-and-breakthrough learning curves.
  5. Engineered solutions exist but training does not discover them.

Outputs (under --out-dir, default = script directory):
  - learning_curves.png        loss vs. mutation step, all runs overlaid
  - trained_runs.png           trained rule arrays + spacetime patterns
  - trained_spacetime.gif      animated row-by-row evolution

Usage:
    python wolfram_lifetime.py
    python wolfram_lifetime.py --t 30 --w 30 --target 15 --max-steps 10000
    python wolfram_lifetime.py --no-gif
"""

from __future__ import annotations

import argparse
import random
from pathlib import Path

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation, PillowWriter


# -----------------------------------------------------------------------------
# Elementary cellular automata
# -----------------------------------------------------------------------------

def rule_to_lut(n: int) -> list[int]:
    """8-entry lookup table for an elementary (k=2, r=1) CA.
    Indexed by (left << 2) | (center << 1) | right."""
    return [(n >> i) & 1 for i in range(8)]


RULE_4_LUT = rule_to_lut(4)      # class 2 -- decay
RULE_146_LUT = rule_to_lut(146)  # class 3 -- chaotic


def evolve(rule_array: list[list[int]], initial_row: list[int]) -> list[list[int]]:
    """Run the 1D CA forward, selecting rule 4 or rule 146 per cell.

    `rule_array[t][w]` selects which rule applies at time t, column w
    (0 -> rule 4, 1 -> rule 146). Width is cyclic.

    Returns the spacetime pattern: a (T+1) x W list of lists.
    """
    T = len(rule_array)
    W = len(initial_row)
    pat = [list(initial_row)]
    for t in range(T):
        prev = pat[t]
        new_row = [0] * W
        for w in range(W):
            # cyclic neighborhood: modulo handles both walls
            idx = (prev[(w - 1) % W] << 2) | (prev[w] << 1) | prev[(w + 1) % W]
            lut = RULE_4_LUT if rule_array[t][w] == 0 else RULE_146_LUT
            new_row[w] = lut[idx]
        pat.append(new_row)
    return pat


def lifetime(pat: list[list[int]]) -> int:
    """Largest row index containing any black cell. 0 if dead from the start."""
    for t in range(len(pat) - 1, -1, -1):
        if any(pat[t]):
            return t
    return 0


# -----------------------------------------------------------------------------
# Training: single-point mutation hill climbing
# -----------------------------------------------------------------------------

def loss(rule_array: list[list[int]], initial_row: list[int], target: int) -> int:
    return abs(lifetime(evolve(rule_array, initial_row)) - target)


def train(
    target: int,
    T: int,
    W: int,
    initial_row: list[int],
    max_steps: int = 20_000,
    seed: int | None = None,
) -> tuple[list[list[int]], list[int]]:
    """Hill-climb a rule array to lifetime = target.

    Starts from a random binary rule array of shape (T, W). Each iteration
    flips one random cell; the flip is kept iff `new_loss <= cur_loss` (the
    `<=` is what lets the search cross plateaus). Early-stops on loss == 0.

    Returns (trained_rule_array, loss_history). The history has one entry
    per iteration, so its length equals the number of mutations attempted
    (+ 1 for the initial state).
    """
    rng = random.Random(seed)
    arr = [[rng.randint(0, 1) for _ in range(W)] for _ in range(T)]
    cur_loss = loss(arr, initial_row, target)
    history = [cur_loss]

    for _ in range(max_steps):
        i, j = rng.randrange(T), rng.randrange(W)
        arr[i][j] = 1 - arr[i][j]  # flip
        new_loss = loss(arr, initial_row, target)
        if new_loss <= cur_loss:
            cur_loss = new_loss
        else:
            arr[i][j] = 1 - arr[i][j]  # revert
        history.append(cur_loss)
        if cur_loss == 0:
            break

    return arr, history


# -----------------------------------------------------------------------------
# Visualization
# -----------------------------------------------------------------------------

def plot_learning_curves(runs, target: int, out_path: Path) -> None:
    """Loss vs. mutation step for all runs, overlaid."""
    fig, ax = plt.subplots(figsize=(8, 4))
    for k, (_, hist) in enumerate(runs):
        ax.plot(hist, lw=0.9, alpha=0.85, label=f"run {k+1} ({len(hist)-1} steps)")
    ax.set_xlabel("mutation step")
    ax.set_ylabel(f"loss = |lifetime - {target}|")
    ax.set_title(f"Learning curves - {len(runs)} independent runs, target={target}")
    ax.legend(fontsize=8)
    ax.grid(alpha=0.3)
    fig.tight_layout()
    fig.savefig(out_path, dpi=150, bbox_inches="tight")
    plt.close(fig)


def plot_trained_runs(runs, initial_row, target: int, out_path: Path) -> None:
    """Trained rule arrays (top) and their spacetime patterns (bottom)."""
    n = len(runs)
    fig, axes = plt.subplots(2, n, figsize=(3 * n, 7.2), constrained_layout=True)
    if n == 1:
        axes = axes.reshape(2, 1)

    for k, (arr, hist) in enumerate(runs):
        axes[0, k].imshow(arr, cmap="binary", aspect="equal",
                          interpolation="nearest", vmin=0, vmax=1)
        axes[0, k].set_title(f"run {k+1}: rule array\n({len(hist)-1} mutations)",
                             fontsize=10)
        axes[0, k].set_xticks([]); axes[0, k].set_yticks([])
        if k == 0:
            axes[0, k].set_ylabel("LEARNED\nrule array\n(black=146, white=4)",
                                  fontsize=9)

        pat = evolve(arr, initial_row)
        axes[1, k].imshow(pat, cmap="binary", aspect="equal",
                          interpolation="nearest", vmin=0, vmax=1)
        axes[1, k].axhline(target + 0.5, color="red", lw=1.0, ls="--",
                           alpha=0.8, label=f"target row {target}")
        axes[1, k].set_title(f"spacetime (lifetime {lifetime(pat)})", fontsize=10)
        axes[1, k].set_xticks([]); axes[1, k].set_yticks([])
        if k == 0:
            axes[1, k].set_ylabel("RESULTING\nspacetime\n(time flows down)",
                                  fontsize=9)
            axes[1, k].legend(loc="lower right", fontsize=7)

    fig.suptitle(
        f"Wolfram §5 replication - {n} independent training runs, target lifetime = {target}\n"
        "Each run learns a different rule array; spacetimes are unrelated but all die at the target row.",
        fontsize=11,
    )
    fig.savefig(out_path, dpi=150, bbox_inches="tight")
    plt.close(fig)


def animate_trained_runs(
    runs, initial_row, target: int, out_path: Path, fps: int = 5
) -> None:
    """Animated gif: rule array (top) and spacetime (bottom) reveal in lockstep,
    one row per frame. Rule-array row t produces spacetime row t+1.
    """
    n = len(runs)
    arrs = [np.array(arr) for arr, _ in runs]
    pats = [np.array(evolve(arr, initial_row)) for arr, _ in runs]
    lifetimes = [lifetime(p.tolist()) for p in pats]
    mutations = [len(h) - 1 for _, h in runs]
    n_frames = pats[0].shape[0]  # T + 1

    fig, axes = plt.subplots(2, n, figsize=(3 * n, 6.6))
    if n == 1:
        axes = axes.reshape(2, 1)
    fig.subplots_adjust(top=0.86, bottom=0.04, left=0.06, right=0.98,
                        hspace=0.25, wspace=0.10)

    def render(t):
        for k in range(n):
            ax_top = axes[0, k]
            ax_top.clear()
            ax_top.set_xticks([]); ax_top.set_yticks([])
            ax_top.set_title(
                f"run {k+1}: rule array\n({mutations[k]} mutations)", fontsize=10
            )
            display_arr = np.zeros_like(arrs[k])
            if t > 0:
                display_arr[:t] = arrs[k][:t]
            ax_top.imshow(display_arr, cmap="binary", aspect="equal",
                          interpolation="nearest", vmin=0, vmax=1)
            if k == 0:
                ax_top.set_ylabel("LEARNED\nrule array\n(black=146, white=4)",
                                  fontsize=9)

            ax_bot = axes[1, k]
            ax_bot.clear()
            ax_bot.set_xticks([]); ax_bot.set_yticks([])
            ax_bot.set_title(f"spacetime (lifetime {lifetimes[k]})", fontsize=10)
            display_pat = np.zeros_like(pats[k])
            display_pat[: t + 1] = pats[k][: t + 1]
            ax_bot.imshow(display_pat, cmap="binary", aspect="equal",
                          interpolation="nearest", vmin=0, vmax=1)
            ax_bot.axhline(target + 0.5, color="red", lw=1.0, ls="--",
                           alpha=0.8, label=f"target row {target}")
            if k == 0:
                ax_bot.set_ylabel("RESULTING\nspacetime\n(time flows down)",
                                  fontsize=9)
                ax_bot.legend(loc="lower right", fontsize=7)

        fig.suptitle(
            f"Wolfram §5 replication - {n} independent training runs, "
            f"target lifetime = {target}  -  step {t}/{n_frames - 1}\n"
            "Each run learns a different rule array; spacetimes are unrelated "
            "but all die at the target row.",
            fontsize=11,
            y=0.985,
        )

    anim = FuncAnimation(fig, render, frames=n_frames, interval=1000 // fps)
    anim.save(out_path, writer=PillowWriter(fps=fps))
    plt.close(fig)


# -----------------------------------------------------------------------------
# Main
# -----------------------------------------------------------------------------

def main() -> None:
    ap = argparse.ArgumentParser(
        description="Train a rule array to a target lifetime (Wolfram §5)."
    )
    ap.add_argument("--t", type=int, default=60, help="time steps (rule-array rows)")
    ap.add_argument("--w", type=int, default=60, help="width (rule-array cols)")
    ap.add_argument("--target", type=int, default=30, help="target lifetime")
    ap.add_argument("--max-steps", type=int, default=20_000,
                    help="max mutations per run")
    ap.add_argument("--n-runs", type=int, default=4,
                    help="number of independent training runs")
    ap.add_argument("--seed", type=int, default=7,
                    help="base seed (per-run seeds are derived)")
    ap.add_argument("--out-dir", type=Path,
                    default=Path(__file__).resolve().parent,
                    help="where to save figures")
    ap.add_argument("--no-gif", action="store_true",
                    help="skip the animated gif (faster)")
    args = ap.parse_args()

    if not 0 <= args.target <= args.t:
        raise SystemExit(f"target must be in [0, {args.t}]; got {args.target}")

    initial_row = [0] * args.w
    initial_row[args.w // 2] = 1

    rng = random.Random(args.seed)
    seeds = [rng.randrange(10**9) for _ in range(args.n_runs)]

    print(f"Training {args.n_runs} runs at T={args.t}, W={args.w}, target={args.target}...")
    runs = []
    for k, s in enumerate(seeds):
        arr, hist = train(
            args.target, args.t, args.w, initial_row,
            max_steps=args.max_steps, seed=s,
        )
        lt = lifetime(evolve(arr, initial_row))
        print(f"  run {k+1} seed={s}: final loss={hist[-1]}, "
              f"lifetime={lt}, mutations={len(hist) - 1}")
        runs.append((arr, hist))

    args.out_dir.mkdir(parents=True, exist_ok=True)
    plot_learning_curves(runs, args.target, args.out_dir / "learning_curves.png")
    plot_trained_runs(runs, initial_row, args.target,
                      args.out_dir / "trained_runs.png")
    print(f"Saved learning_curves.png and trained_runs.png to {args.out_dir}")

    if not args.no_gif:
        animate_trained_runs(runs, initial_row, args.target,
                             args.out_dir / "trained_spacetime.gif")
        print(f"Saved trained_spacetime.gif to {args.out_dir}")


if __name__ == "__main__":
    main()

Footnotes

[1] If readers are not familiar with NN, here is a great introduction: 3Blue1Brown — But what is a neural network?.

[2] The column is space, and the row is time.

[3] In the case of a list in Python, we usually write from left to right.

[4] Integer division.

[5] Modulo.

Interchange Fee and the Universal Acceptance of Cards

KY John — Sat, 09 May 2026 08:00:45 GMT

Payment is fundamentally a two-sided market1. Cardholders want more merchants accepting their cards. Merchants want more cardholders. Increasing the number of merchants in the network comes with costs on the acquirer2 but also benefits the cardholders, and vice versa. In economic terms, it is called externalities.

We, therefore, cannot analyze the payments industry by looking at only one side. The total cost of payment equals the sum of the issuer’s cost3, the acquirer’s cost, and other frictions. Unfortunately, the cost structure is quite imbalanced for credit payment. The marginal cost of processing an additional transaction to issuers typically includes the cost of funds, cost of risk4, payment fraud, and customer incentives5. The marginal cost to acquirers is essentially zero once the infrastructure, such as a POS machine, is set up, ignoring transaction fees paid to the card scheme and issuers6. Increasing the number of cardholders benefits acquirers more because the acquirer keeps the MDR7 from more transactions without doing anything extra.

Negotiations between issuers and acquirers on cost sharing are infeasible if universal acceptance is the goal. The number of bilateral agreements needed in the network, assuming only pure issuers and acquirers exist and their numbers are the same, is n². Card schemes, such as VISA and Mastercard, internalize externalities and reduce the transaction costs of negotiating bilateral agreements, thereby making universal card acceptance possible. In the court case between National Bancard Corporation (NaBANCO) and VISA in the 1980s8, the court concluded that “The IRF9 is a mechanism by which VISA ensures the universality of its card, not a price fixing device to squeeze out entrepreneurs,” and “redistribution of revenues or costs is a must for the continued existence of the product.”

The court case also addressed the market in which VISA operates. NaBANCO believed there were three distinct markets: card issuing, merchant servicing, and interchange for receivables. VISA believed there was only a single market that included substitutes, such as cash, cheque, ATM cards, Amex, and Diners Club. The market definition dispute matters because interchange costs cannot be viewed solely as merchant costs. If customers can switch among substitutes, charging interchange fees shifts usage across payment instruments rather than simply lowering merchant cost.

In some markets, regulators cap interchange fees, including Spain, Australia, the United States, and the European Union. The International Center for Law & Economics has written a paper on The Effects of Price Controls on Payment Card Interchange Fees: A Review and Update. The empirical evidence is mixed and institutionally contested. Critics of interchange caps argue that caps reduce issuer revenue, weaken rewards, and lead to incomplete merchant pass-through. Regulators such as the European Commission, however, argue that caps lowered merchant service charges and produced consumer benefits through lower prices or improved retail services.

In Southeast Asia, QR payment has become an important digital payment method. Unlike China, Central Banks in the Region act as intermediaries, providing national QR infrastructure and effectively setting the interchange fee to zero and capping the MDR to a very low level. The inclusion of credit issuers inevitably reduces. However, the Central Banks seem to optimize for payment-rail inclusion more than credit-rail inclusion and are determined to transition from a cash economy to a digital economy.

The framework, therefore, predicts which side has a structural advantage in each market. Where MDR and interchange are high, credit issuers can participate profitably in off-us transactions10. Where MDR is compressed, and interchange is zero or near zero, acquirers and payment-rail operators may still benefit from transaction volume, but credit issuers face weaker unit economics unless a separate credit monetization layer exists, such as higher consumer interest, merchant-funded BNPL fees, on-us ecosystems, or issuing credit cards.

Two-sided market does not mean it involves two parties: a buyer and a seller. It means the total volume depends on the number of participants on the other side.

Enabling merchants to accept cards.

Availing buyer the means to send payments. Be it debit, credit, cheque, or cash.

Payment methods involve credit facilities, such as credit cards.

Individuals tend to be more price-sensitive than merchants.

VISA or Mastercard charged acquirers a card scheme fee for using the network and an interchange fee to balance the costs.

Merchant Discount Rate: the fee charged to merchants for processing the payment.

National Bancard Corp. (NaBanco) v. Visa USA, Inc., 779 F. 2d 592 - Court of Appeals, 11th Circuit 1986.

Issuer reimbursement fee: another name for interchange fee.

The merchants are not acquired by the issuers.

What 50 years of Singapore property prices tell us about risk

KY John — Sat, 25 Apr 2026 08:07:08 GMT

[Ideas are mine but analysis and texts are generated by AI]

Imagine two friends, Ahmad and Bao. Both decide to buy a private apartment in Singapore. Same building, same size. Ahmad buys in 1996 Q3. Bao buys in 2004 Q1. They both hold for five years.

Ahmad’s apartment, by 2001, is worth about 30% less than what he paid. He still has his mortgage. He’s underwater for almost a decade.

Bao’s apartment, by 2009, is worth about 45% more than what he paid. He’s the smart investor at every dinner party.

Same asset class. Same country. Same five-year holding period. Wildly different outcomes.

Most people, looking at this, will say “Ahmad got unlucky.” Or “Bao timed the market well.” Both are wrong, in an interesting way. The honest explanation is statistical, and once you see it, you’ll never read a “long-run average return” line in a property advert the same way again.

This piece walks through what 50 years of Singapore property data actually look like, what statisticians call fat tails, and the single most important — and most under-appreciated — fact about long-term property risk.

The “average return” story

Open any property-marketing brochure and you’ll see something like: “Singapore private property has appreciated by an average of 6.4% per year over the past 50 years.”

That’s not wrong. We pulled 50 years of quarterly data from SingStat — the URA Private Residential Property Price Index, from 1975 Q1 to 2025 Q4 — and the average comes out to about +1.57% per quarter, which annualises to roughly +6.4%. So far, so consistent.

The trouble is that the average is one of the least useful numbers you can compute about property returns. Here’s the actual time series:

The top panel is the price index. The bottom panel is the quarterly log-returns — basically, how much the index moved each quarter. Notice anything?

That bottom chart is not a tame, well-behaved process. It has periods of intense activity (the 1980–81 boom, the 1993–96 boom, the 1997–98 Asian crisis, the 2008 GFC dip and bounce) and long stretches of relative calm. The quarterly returns range from −15.2% (worst, 2009 Q1) to +24.4% (best, 1981 Q1). A single quarter can easily move the price by more than the average annual return.

Calling the average “+6.4%/yr” technically true and practically misleading is what statisticians call a Mediocristan-vs-Extremistan problem. We’ll come back to that.

The thing we don’t talk about: kurtosis

Here is the single most useful number for thinking about risk in fat-tailed worlds, and almost nobody outside finance ever encounters it: kurtosis.

(Pronounced “kur-TOH-sis.” Greek.)

It’s a measure of how much of the action is in the extremes versus the middle of a distribution. The bigger it is, the more the rare events dominate.

For a “normal” bell-curve distribution — the kind you learned about in school — kurtosis is 3.

For Singapore private residential property quarterly returns? 6.5.

For Singapore HDB resale flat returns? 18.2.

The HDB number is genuinely shocking. It is a direct symptom of one specific quarter: 1993 Q2, when HDB resale prices jumped by 27% in three months as a speculative boom rolled through public housing. That one observation, three decades later, still accounts for 74% of the total kurtosis of the entire HDB return distribution since 1990.

Translation in plain English: when you compute “the kurtosis of HDB returns” using 35 years of quarterly data, three-quarters of the answer comes from one quarter. The other 142 quarters together explain only the remaining 26% of it.

This is what “fat-tailed” means in practice. A single, unusual event dominates a statistic that’s supposed to summarise 35 years.

A picture is worth a thousand standard deviations

Here is the same point in chart form. The bars compare four Singapore property indices on four different “fat-tailedness” measures:

The red dashed lines show what a “normal-ish” distribution would look like.

Pearson kurtosis should be 3 for normal. SG indices range 5.6 to 18.2. Max quartic share — how much of the kurtosis is driven by one observation — should be about 0.05 for normal. SG indices range 0.31 to 0.74. Student-T fitted df, a sophisticated fat-tail measure where lower means fatter, with values near 2 indicating the underlying variance estimate is unreliable: SG indices 2.13 to 2.51. Hill α tail exponent, where lower means heavier tail and below 4 means kurtosis itself is mathematically infinite in the population: SG indices 2.17 to 2.91.

Across every measure, every SG property index is in the “fat-tailed” zone. None of them resemble the bell curve that most financial planning quietly assumes.

Why this matters, in dollars

Most retirement planning, mortgage stress-testing, and “how much can I afford?” calculations implicitly assume returns behave like a normal distribution. They use words like “expected return ± standard deviation” and reason about probabilities the way you’d reason about coin flips.

In a normal-distribution world, a quarter that’s 5σ off the mean is essentially impossible. About 1 in 1.7 million.

In a fat-tailed world, what financial textbooks call a “5σ event” can be a 1-in-100 event, or even more frequent. The 1993 Q2 HDB jump is a 7σ event in a normal model. The 1981 Q1 URA spike is a ~5σ event. The 2009 Q1 GFC dip is a ~3.4σ event. We’ve had multiple of these in 50 years, in just one country, in one asset class.

What that actually means if you’re buying a Singapore property today: your downside in any single quarter could be 15% or more, even with no leverage. Your upside in any single quarter could be 25% or more. The “1-in-50-years” event is more like 1-in-10. Standard deviation, as a measure of “how much can this move,” is simply not informative.

But there’s a more important fact than any of these — and it’s the one I think almost nobody knows about.

The real lesson: it’s clustering, not noise

Here is the most underappreciated finding in all of this. The chart below is the headline diagnostic from Nassim Taleb’s Statistical Consequences of Fat Tails. We applied it to Singapore data:

Each panel is one Singapore property index. The blue line shows the kurtosis (fat-tailedness) of returns aggregated over different time windows — 1 quarter, 2 quarters, all the way to 12 quarters (3 years). The green dashed line shows the same calculation, but with the returns randomly reshuffled in time.

Reshuffling preserves the distribution of returns. Same average, same variance, same fat marginal tails. What it destroys is the order: the relationship between consecutive quarters.

Look at what happens. The blue line (real, actual time-ordered data) stays elevated. Even at 12-quarter aggregation, kurtosis is well above the “normal” baseline of 3. The green dashed line (reshuffled) drops to roughly 3 — Gaussian — by lag 8 or 10.

If property returns were truly fat-tailed but independent quarter-to-quarter, the blue and green lines would look the same. They don’t.

What this means: the multi-year fat-tailed risk in Singapore property is not primarily about how fat-tailed the quarterly distribution is. It’s about the fact that bad quarters cluster together, and good quarters cluster together. The market spends years at a time in a “boom regime” or a “bust regime.”

Visible in the data:

the 1980–81 boom was 6 consecutive quarters of double-digit gains.
The 1993 HDB boom was two consecutive quarters of +27% then +18%.
The 1996–2004 URA bust was 10-quarter, ~45% peak-to-trough drawdown, after which the index stayed below the 1996 peak for over a decade.
The 2008 GFC dip was 4 consecutive negative quarters before recovery.

This is the difference between two mental models. Model A says fat-tailed but independent — each quarter is an independent draw from a fat-tailed distribution; long-run drawdowns wash out via central-limit theorem; standard fat-tail stats describe the risk well; 5-year worst case is moderate. Model B says clustered booms and busts — quarters within a regime are correlated; regimes are persistent; long-run drawdowns compound because all the bad quarters land in the same window; you also need to model regime persistence; 5-year worst case is severe (45%+ drawdowns observed).

The data say Model B is correct.

What this means for an actual buyer

If you buy property today expecting “the long-run average return is 6.4% per year, so over 5 years I should average 32% appreciation,” you’re using Model A. Model A says 5 years of bad luck is a freak event.

Model B says: the 5-year outcome is dominated by what regime you’re in, which is largely determined by when you bought. It’s not about luck. It’s about timing relative to the regime.

Practical consequences:

Late-regime entries are more dangerous than they look. Buying at the top of a sustained price run (high transaction volume, optimistic narratives, rising leverage in the system) raises the conditional probability that you’re at the start of a bust regime. Buying at the bottom of a sustained drawdown (low transaction volume, pessimistic narratives, distressed listings) does the opposite. Same property, same physical asset — wildly different conditional 5-year outcomes.
Holding period interacts with regime, not with calendar time. A 5-year holder who happens to enter the start of a boom regime exits at peak. A 5-year holder who enters at the start of a bust compounds losses. Same hold length, completely different experience.
Leverage that’s “safe in normal times” can ruin you in a long bust. A 70% LTV ratio is a manageable risk in a stable regime — even with the occasional bad quarter. In a 7-year bust regime where prices fall 40%, the same 70% LTV means negative equity for years. You can’t refinance, you can’t sell at non-distressed prices, and you have to keep servicing the loan throughout. Th==e risk is not the worst single quarter; it’s the duration of the bust regime exceeding the buffer your finances can absorb.
Stress-test using historical events, not multiples of σ. When your bank or your spouse asks “what’s the worst case?”, don’t say “two standard deviations down.” Say “what if 1996–2004 happens again?” That’s a 45% drawdown over 7 years with the property essentially impossible to sell at non-distressed prices throughout. If your finances can survive that, you can buy. If they can’t, you can’t, regardless of what the average suggests.
Don’t outsource your regime-judgment to property gurus. Most property advice in Singapore is sold by people whose income depends on you transacting. They are structurally biased to believe whatever bullish or bearish narrative drives volume. Your actual cycle-positioning decision should rest on regime indicators (transaction volumes, price-to-rent ratios, mortgage-to-income ratios, supply pipelines, policy direction) — not on someone else’s confidence.

The bigger lesson

You can extend this beyond property. Almost every domain you care about is fat-tailed in this same way.

Equity markets have clustered crashes (1929, 1987, 2000, 2008, 2020)
Career outcomes are dominated by a few breakthrough years.
Most of your lifetime medical risk is concentrated in a few specific events.
A small fraction of relationships carry most of the long-term consequence.

The intuitive ways we think about averages and volatility — trained by exam questions about coin flips and dice — break down in these domains. The standard deviation underestimates the spread, the mean is dominated by outliers, and “averaging out over the long run” doesn’t happen because the long run is dominated by clustered regimes, not by independent draws.

Want to verify this yourself?

Here’s the entire analysis pipeline. Less than 100 lines of Python. The data is publicly available — no special access required.


import json
import urllib.request
import numpy as np
import pandas as pd
from scipy import stats

# 1. Fetch URA Private Residential PPI from SingStat (public API)
URL = "https://tablebuilder.singstat.gov.sg/api/table/tabledata/M212261?limit=5000"
data = json.loads(urllib.request.urlopen(URL).read())
row = next(r for r in data["Data"]["row"] if r["rowText"] == "Residential Properties")
ppi = pd.Series(
    [float(c["value"]) for c in row["columns"]],
    index=pd.PeriodIndex(
        [pd.Period(f"{c['key'].split()[0]}Q{c['key'].split()[1][0]}", freq="Q")
         for c in row["columns"]],
        freq="Q",
    ),
)

# 2. Compute quarterly log-returns
r = np.log(ppi / ppi.shift(1)).dropna().values

# 3. Headline statistics
print(f"n = {len(r)} quarterly observations, 1975Q1 to 2025Q4")
print(f"mean per quarter   = {r.mean():+.4f}")
print(f"Pearson kurtosis   = {stats.kurtosis(r, fisher=False):.2f}  (Gaussian = 3)")

# 4. Single-event dominance test
x4 = r ** 4
print(f"Max single-quarter share of total x^4: {x4.max() / x4.sum():.3f}")

# 5. Clustering test (Taleb's Fig 10.2)
def kurt_at_lag(returns, lag):
    n = len(returns) // lag
    aggregated = returns[:n*lag].reshape(n, lag).sum(axis=1)
    return stats.kurtosis(aggregated, fisher=False)

print(f"raw,        lag=12: {kurt_at_lag(r, 12):.2f}")
rng = np.random.default_rng(42)
shuffled = []
for _ in range(200):
    s = r.copy(); rng.shuffle(s)
    shuffled.append(kurt_at_lag(s, 12))
print(f"reshuffled, lag=12: {np.mean(shuffled):.2f}  (mean of 200 shuffles)")
```

When you run it, you'll see the raw lag-12 kurtosis stays well above 3, but the reshuffled version drops to about 2.7 — exactly the volatility-clustering signature this article describes.

Disclosure

Both this article and the Python analysis code were generated by Claude, Anthropic's AI assistant. The data is from SingStat (Singapore's Department of Statistics) and is publicly available. The methodology follows Chapter 10 of Nassim Nicholas Taleb's Statistical Consequences of Fat Tails (2020), particularly Figure 10.2 on the role of volatility clustering in apparent fat-tailedness.

I ran every test described, generated every chart from the actual data, and have full reproducibility. If anything looks wrong, please tell me — fat-tail analysis is famously easy to get subtly wrong, and a second pair of eyes is welcome.

The Hidden Interest Rate in Your Insurance Bill

KY John — Sun, 19 Apr 2026 02:15:45 GMT

You’re 64 years old, flipping through AIA’s health insurance brochure. The same basic plan shows four prices for the same coverage:

Annual: HKD 72,440
Semi-annual: HKD 36,944 × 2 = 73,888
Quarterly: HKD 20,280 × 4 = 81,120
Monthly: HKD 6,400 × 12 = 76,800

Two things jump out. First, monthly is only about 6% more than annual — tempting for the cash-flow relief. Second, quarterly is the most expensive of the four — more than monthly. Strange. Is it just a weird pricing quirk, or is the insurer telling you something?

The 6% on monthly isn’t a fee — it’s interest, and you’re paying it on money still sitting in your pocket. So the right question isn’t “how much extra?” but “what interest rate am I paying to delay?“

The exact answer is IRR. The useful one lives in your head.

Let m be the monthly payment and r the monthly rate. For the installment stream to be worth the same as paying P upfront today:

P = m + m/(1+r) + m/(1+r)² + ⋯ + m/(1+r)¹¹

Solving this exactly is what IRR does. Instead, use the approximation anyone with a calculus class already knows:

1 / (1+r)ⁿ ≈ 1 − nr (small r)

Plug it in and the messy geometric series collapses into a clean arithmetic one:

P ≈ m · [ 1 + (1 − r) + (1 − 2r) + ⋯ + (1 − 11r) ] = 12m − m·r·(1 + 2 + ⋯ + 11)

That last sum is 66. So:

P ≈ 12m − 66 m r = 12m · (1 − 5.5 r)

Here’s where the intuition crystallizes. That 5.5 isn’t arbitrary — it’s the average number of months you delay a payment. The first payment is on day zero, the last is at month 11; on average, each dollar is delayed 5.5 months.

Now flip it into years. Let R = 12r be the annual rate, and call d̄ = 5.5 / 12 ≈ 0.458 the average delay in years. Then:

P ≈ A · (1 − R · d̄), where A = 12m is the total you pay.

One rearrangement gives the rule:

R ≈ markup / d̄, with markup = (A − P) / A.

The implied rate is the markup divided by the average delay in years. Nothing more.

The cheat sheet

For N equal payments spaced evenly over a year, the average delay is the arithmetic-sum shortcut applied once and converted to years:

d̄ = (0 + 1 + ⋯ + (N−1)) / N × 1/N yr = (N − 1) / (2N) yr

Monthly (N = 12): d̄ = 11/24 ≈ 0.458 yr → multiplier ≈ 2.2× (markup ÷ 0.458)
Quarterly (N = 4): d̄ = 3/8 = 0.375 yr → multiplier ≈ 2.7× (markup ÷ 0.375)
Semi-annual (N = 2): d̄ = 1/4 = 0.25 yr → multiplier = 4× (markup ÷ 0.25)

Notice the pattern: fewer payments → shorter average delay → a given markup implies a higher rate. A 3% markup on a semi-annual plan is expensive in rate terms (~12%); the same 3% on a monthly plan is ~6.5%.

Back to the AIA plan

Apply the rule R ≈ markup / d̄:

Semi-annual: markup 2.0% × 4 = ~8% per year
Monthly: markup 5.7% × 2.2 = ~13% per year
Quarterly: markup 10.7% × 2.7 = ~29% per year

Suddenly the strange quarterly pricing makes sense. The insurer isn’t making a mistake — they’re quietly charging you roughly 30% per year to stretch payments over nine months. The monthly plan is a far better deal (and annual is better still if you have the cash).

Now you can actually choose. If your alternative is a credit card at 24%, the monthly plan’s ~13% is cheaper money — take it. If your cash is sitting in a HK deposit at 4%, pay annual and pocket the 9-point spread. And whatever you do, don’t pick quarterly unless you have no other option, because you’re borrowing at subprime rates for the privilege.

When it breaks

The approximation is tight when nr is small. For semi-annual and monthly above, it matches the exact IRR within a percentage point — 7.8% vs 8.3%, 12.4% vs 13.8%. For quarterly, where the true rate is ~37%, the linearization starts to fail and our shortcut underestimates by about 8 points. The rule of thumb is: trust the shortcut up to ~20% annualized; beyond that, it still tells you the direction (”this is expensive”) but you’ll want a spreadsheet for the exact number.

It also breaks when the schedule isn’t evenly spaced — a big deposit plus small monthlies, or a ballooning final payment. Same fix: spreadsheet.

But for the ordinary case of comparing payment frequencies on an insurance quote: divide the markup by the average delay in years, and you have your interest rate. It takes five seconds and it’s usually the difference between a good deal and a quietly terrible one.

When to Use a Blacklist, and When to Use a Rule

KY John — Sat, 04 Apr 2026 04:16:20 GMT

After joining the anti-fraud team, I noticed that blacklists are used far more extensively than they were in credit underwriting. That observation led me to reflect on the difference between a blacklist and a rule.

For readers outside the risk management field, a blacklist is a finite set of objects, such as a user ID, device fingerprint, phone number, or identity card number. Entities on the blacklist are not allowed to use the product. A rule, by contrast, is a decision function: it takes an input and produces an outcome, such as pass, reject, or review.

Mathematically, both can be expressed as indicator functions:

R(x) = 1[C(x)]

where C(x) is some condition on x, and

R(x)=1[X in B]

where B is a list.

In this sense, a blacklist is simply a special case of a general rule in which the condition is set membership. Mathematically, both are the same type of object: they map an input to a decision. In practice, however, the distinction remains useful. The term rule usually refers to non-list-like logic, while blacklist refers to list-like logic based on explicit membership. The real question, then, is not whether a blacklist is a rule, but when we should rely on list-like rules and when non-list-like rules are more appropriate.

A blacklist is like a memory of bad actors. Once a bad actor is identified, the goal is to prevent them from exploiting the product again. But this only works under two conditions: first, we can identify them reliably; second, the identifier is not easy to replace or rotate. For that reason, blacklists are most suitable for high-confidence, confirmed cases tied to relatively durable identifiers. If our confidence is low, or if the identifier can be changed easily, a blacklist may do more harm than good by blocking legitimate users while doing little to stop the bad actor in the long run.

A non-list-like rule, by contrast, captures a more general pattern of risk. Someone rejected by the rule today may not be rejected tomorrow, because the decision depends on current attributes or behavior rather than fixed membership in a list. This makes rules more suitable when the signal is weaker, more probabilistic, or tied to identifiers that can be changed easily.

In practice, however, teams often blur the boundary between the two. Low-confidence signals or easily rotated identifiers are sometimes added to blacklists, which can create high false-positive rates. Conversely, even high-confidence bad actors are sometimes handled only through dynamic rules, leaving room for repeated breaches once the pattern changes or the rule is circumvented.

The key is to match the tool to the nature of the signal. A blacklist works best when the signal is strong and the identity is durable. A rule works better when the signal is less certain or when the adversary can easily change identifiers. Although the two are mathematically similar, they play different operational roles: a blacklist acts as memory, while a rule acts as generalized reasoning. Confusing the two can either block too many good users or allow known bad actors to return.

If You Cannot Create It, You Don't Understand It — Even with AI"

KY John — Sun, 29 Mar 2026 04:00:14 GMT

I recently built a tiny autograd engine from scratch — just a hundred lines of Python that can compute gradients through a computation graph. The kind of thing an LLM could spit out in seconds. But I didn’t let it. I wrote every line myself, hit bugs, traced gradients by hand on paper, and asked the AI to check my work — not to do it for me. And honestly? It was the most satisfying learning experience I’ve had in a while.

Reflections on Learning with AI

LLMs can hinder learning if you let them: Learning is fundamentally hard because it requires working your brain. An LLM can provide solutions instantly, but that bypasses the struggle where real understanding is built. Think of the LLM as a teacher sitting beside you — the teacher cannot complete the task for the student. You still have to do the thinking.

LLMs can act as a teacher, but only if you drive the conversation: They can verify your understanding, answer questions, and generate practice problems. However, current LLMs are not proactive teachers — they respond, they don’t lead. It takes skill to prompt the LLM effectively, which is challenging, especially for younger students who may not yet be comfortable with self-directed learning.

LLMs are better at details than intuition: They tend to explain step-by-step solutions well but struggle to convey the big picture and the intuition behind the concepts. For example, Section 1 (The Big Picture) of these notes was my own mental model, polished by the LLM — not generated by it.

Writing code by hand is irreplaceable: Having the opportunity to implement something yourself is deeply satisfying and solidifies understanding in a way that reading or prompting never can. As the saying goes, “What I cannot create, I do not understand.” (Richard Feynman). The bugs you hit, the edge cases you miss, the moments where it finally clicks — that’s where the real learning happens.

Attached the study note on autogradiant decent prepared together with LLM:

Study Notes

137KB ∙ PDF file

Download

Can AI Agents Win a Modeling Challenge? A Replicable Experiment

KY John — Wed, 25 Mar 2026 15:21:54 GMT

To get the most out of AI agents, we need to remove human bottlenecks and increase leverage.

If you want to jump straight to the implementation, the full code is here: Github.

What Was Really Being Tested

My company recently held a modeling competition in which participants were allowed to use only AI tools, without writing any code themselves. The objective was straightforward: maximize the AUC of a supervised learning task.

In tabular supervised learning, the practical toolkit is already quite mature. Gradient boosting models remain the dominant high-performance baseline, and in real-world applications the largest gains often come not from changing the model class, but from accumulating more data, engineering better features, and defining targets that better reflect business objectives. Model fine-tuning can help at the margin, but it rarely produces step-change improvements. Even in sequential modeling, much of the value can be understood as learning richer representations of the underlying behavior and structure.

For that reason, the most interesting question was not whether an LLM could write code to implement ideas that human practitioners had already provided. If humans specify the feature candidates, the modeling tricks, and the overall direction, then the exercise becomes largely a test of coding and execution. That is useful, but not especially interesting. A more meaningful test is whether an LLM can autonomously discover promising ideas for itself: what features to construct, which techniques are worth trying, how to interpret intermediate results, and how to iterate toward better performance under a clear objective.

Seen from this perspective, the purpose of the competition, in my own opinion, was not to discover a fundamentally new classification algorithm nor implementing solutions based on human insights. It was to test whether an LLM could independently reproduce the practical workflow behind strong credit risk modeling: generating hypotheses, engineering useful features, running experiments, learning from feedback, and improving performance through iteration. Put differently, if the playbook used by experienced practitioners is only partially specified, how much of it can an LLM recover on its own?

This also made the exercise a test of instruction design. Beyond model performance, it offered a way to understand how tasks should be framed so that an LLM can explore the solution space productively rather than simply execute a predefined recipe.

The Experiment Setup

At first glance, the objective may seem clear enough: give the LLM a target metric and ask it to iterate on its own. But things work more smoothly in human teams because people already share a large amount of tacit context: what counts as a valid experiment, what shortcuts are unacceptable, how performance should be evaluated, and when a result is worth keeping. For an LLM, many of these assumptions have to be made explicit.

To make the exercise meaningful, we needed to specify a set of operating rules that human teams would often leave implicit.

The experimental setup was not fully specified from the start; as I observed the agent’s behavior, I iteratively refined and steered it. For example, in the later stages, I found that the acceptance threshold had become too strict.

Scope. We had to define what the agent was allowed to modify, what it was not allowed to touch, and what kinds of actions were permitted or prohibited.

Integrity. The agent could use only data available before the application date. It also had to check for leakage before proceeding, especially when a single feature appeared to perform suspiciously well.

Evaluation. The evaluation methodology had to be fixed in advance. Otherwise, the LLM could improve reported scores simply by changing the validation setup rather than improving the model itself.

Promotion criteria. We needed to define what magnitude of improvement was sufficient for a new approach to be accepted and carried forward.

Logging. Like humans, LLMs do not naturally produce good documentation unless asked to do so. To make their work inspectable—and to give the agent a usable record of its own progress—we had to explicitly require logging.

Resource constraints. The agent needed rules for efficient experimentation: cache generated features for reuse, avoid recomputing unnecessarily, explore ideas in parallel where possible, and operate within limits on the number and scale of experiments.

Simplicity. Some human judgment still had to be encoded into the setup. In general, we preferred simpler models when they delivered performance comparable to more complex alternatives.

Stopping criteria. We also had to define when the agent should stop iterating, rather than continuing to search indefinitely for marginal gains.

These rules were not just administrative details. They were part of the experiment itself. If the goal was to test whether an LLM could behave like a disciplined modeler, then the environment had to specify the constraints under which disciplined modeling takes place.

You can find the full instruction to the LLM here.

How the LLM Iterated

The LLM spent around 48 hours to reach the final result. In the first 24 hours, I did not prompt the LLM anything and it reached 0.793424 AUC at round 31. The, the LLM stucked for consecutive 12 rounds. I prompted the LLM to leverage websearch to search for ideas, although I already stated that in the original prompt and it preferred relying on its internal knowledge.

Thoughts

With clearly defined objectives and boundaries, an LLM can iterate on its own and uncover most of the early low-hanging fruit much faster than a human can. However, more distant ideas—such as KNN-based approaches or sequential modeling, which require a more creative leap—seem harder for it to discover. That said, the LLM only spent two days exploring the solution space. Without internet access, it is not obvious that a human could have found those ideas within the same time frame either. My guess is that the top-performing team spent far more than two days to achieve its breakthrough, and likely involved multiple team members.

What stands out most is how much cheaper it becomes to try new ideas. The cost of experimentation drops sharply in terms of both time and human effort. With its coding ability, the LLM can translate modeling ideas into working code extremely quickly. Generating roughly 300 additional features, running hyperparameter tuning, stacking multiple models, and building sequential representations of the data—all within two days—would be expensive for a human team. To do the same amount of work, we might need five to ten experienced practitioners. In my case, I did not even fully use the token budget of Claude’s USD 100 max plan.

Will humans lose their jobs? Partly, yes. LLMs will increasingly replace a large share of the coding, pipeline construction, and some feature engineering work. But they still need humans to provide direction. The human role will shift toward defining the scope, constraints, and objectives of the task, and then validating the outputs. At least for now, humans also seem to have better taste—in the sense of searching the solution space more efficiently and recognizing which directions are truly promising. We sometimes hear stories of LLMs surfacing obscure old papers that solve a problem outright, which suggests that this advantage may not always hold. But on average, I still think humans with domain knowledge retain an edge.

Will that remain true as we use LLMs more and progressively transfer domain knowledge into them? I do not know. If forced to give a rough estimate, I would guess that within five to ten years, LLMs may surpass humans even in the discovery of previously unknown solutions.

What seems much clearer is that, over the next five years, human-plus-AI will dominate human-alone workflows. The productivity gain is easily an order of magnitude for many existing tasks, and in some cases effectively unbounded because AI enables work that would not have been attempted otherwise. If you are not using AI every day to learn new things, you are likely falling behind. If you are not using it to accelerate coding-related work, you are almost certainly moving much more slowly. And if you are not using it to help generate new ideas and solutions, you may already be at a meaningful disadvantage. Once these effects compound, the gap between human-only and human-plus-AI workflows becomes very large.

Next Step

If the current limitation is the LLM’s “taste,” then the next frontier is not just better models, but better ways of searching the solution space. The key challenge is to help the LLM explore more intelligently, so that it can identify promising directions earlier instead of relying mainly on broad trial and error. Early thoughts including the use of tree-of-thoughts and Monte Carlo Tree Search (MCTS) which I will try to implement in next week.

I also need to better maximize the available token budget. During this project, the agent never came close to exhausting the token allowance of the Claude Max ($100/month) plan. In practice, a considerable amount of time was spent waiting for model runs to finish while the agent was otherwise idle, suggesting that I did not make full use of the available capacity.

Recommend Readings

How Singapore Savings Bonds Work: A Financial Engineering Perspective

KY John — Sun, 15 Mar 2026 09:07:39 GMT

Crossing the 30s mark, financial life has changed significantly, mainly due to increased liabilities. While human capital remains the primary cash-generating asset, as was common in the early 30s, it is important to build greater robustness into the system because all income depends on continuous production by the brain and body. Robustness can be achieved through downside volatility mitigation and diversification. Insurance and cash reserves can help cap the downside, and financial capital diversifies human capital risk to some extent. This article discusses building cash reserves and, in particular, the characteristics of the Singapore Savings Bonds (SSB).

Cash is “Still” the King

Many financial advisors say cash is trash because the government can “print” money and that cash will depreciate relative to other assets. Therefore, cash is a poor asset to hold. However, it is not always the case. If equity goes down by 18%, the same amount of cash can now buy 22% more equity (1/(1-18%) - 1) — cash gains in relative purchasing power even though its nominal value stays the same. In particular, during monetary tightening, bond prices fall while cash yields rise, making cash an attractive asset to hold. More generally, cash earns interest when treated as an investable asset rather than a medium of exchange, and its yield roughly follows the short end of the yield curve.

Singapore Savings Bonds

SSB is a 10-year government bond issued monthly by MAS, available to individuals in multiples of $500. What makes it unique is its structure: SSB is a single instrument with a guaranteed principal that replicates the SGS yield curve at every holding period — prioritising the step-up feature when the curve shape doesn’t naturally allow it. To my knowledge, this structure offers a few advantages:

Flexibility in holding period: Investors do not have to commit to a fixed holding period, unlike with fixed deposits and bonds. SSBs can be redeemed at par in any given month with no penalty.
No price risk: Unlike a bond, which trades on the secondary market and whose price can drop below par value, the SSB principal is guaranteed.
No reinvestment risk: The coupons are guaranteed and locked in at issuance. For 6-month or 1-year T-bills, when they mature, investors must reinvest at the then-available rate, which may be lower.

Effectively, SSB offers two embedded protections. The first is a free put option at par, allowing the investor to “sell” it back to the government at face value at any time — this protects when rates rise. The second is a free forward rate agreement on the full yield curve, locking in returns for up to 10 years — this protects when rates fall. The limitation, however, is the investment cap ($50,000 per issue and $200,000 per individual).

How SSB Rates Reflect the Yield Curve

To understand how SSB coupons are derived, it helps to look at the underlying SGS yield curve across different rate environments. The chart below shows three panels: the SGS benchmark yields (input), the resulting SSB coupon rates (output), and the average returns investors actually earn.

Panel 1 — SGS Yield Curves (the input)

The yield curve shape has changed dramatically across eras

2015-2017 (purple): Steep upward slope — short rates ~1%, long rates ~2.2%
2020-2021 (teal): Lowest across all tenors — near-zero short rates from COVID easing
2022-2024 (red): Inverted — short rates (~3.5%) are higher than long rates (~3%) from aggressive rate hikes
2025-2026 (orange): Re-steepening as short rates fall faster than long rates

Notably, all eras converge to a narrower range at 10Y (~1.4-3%) than at 1Y (~0.3-3.5%), reflecting the fact that long-term rate expectations are more stable than short-term policy rates.

Panel 2 — SSB Coupons (the output)

The step-up mechanism transforms the yield curve shape into a monotonically increasing schedule
Inverted curves (red) become flat coupons — the monotonicity adjustment compresses everything to the 10Y level
Steep curves (purple) produce dramatic step-ups — from ~1% to ~3.5%
The Y10 coupon is always higher than the Y10 yield because it compensates for the lower early coupons

Panel 3 — Average Returns (what investors actually earn)

This is the smoothed version of the coupon schedule — always monotonically increasing and more gradual
At 10Y, it matches the SGS yield in Panel 1 — confirming the design works
At shorter tenors, it matches SGS yields only when no monotonicity adjustment is needed

Key takeaway

The three panels tell the story of a single financial engineering pipeline: a market yield curve (which can be any shape — steep, flat, inverted) gets transformed into an investor-friendly product that always steps up, always returns principal, and always delivers the 10Y market return. The price of this guarantee is visible in the red era — when the curve inverted, short-term SSB holders earned well below the prevailing T-bill rate because the step-up constraint sacrificed short-tenor returns to preserve the monotonicity feature.

Methodology

For those interested in the methodology for deriving the SSB coupon from SGS benchmark rates, you may refer to this MAS paper. Here is a high-level skeleton:

Step 1 — Get Benchmark Yields

Download daily SGS benchmark yields from MAS Benchmark Prices and Yields
Specifically: 1-Year T-Bill Yield, 2-Year Bond Yield, 5-Year Bond Yield, 10-Year Bond Yield
Compute the simple average of all trading days from month M-2 (two months before the SSB issue month)

Step 2 — Interpolate the Full Yield Curve

Use a hermite spline to fill in the missing tenors (3Y, 4Y, 6Y, 7Y, 8Y, 9Y)
Now you have 10 par yields: Y1, Y2, ..., Y10

Step 3 — Bootstrap Discount Factors

Convert par yields into discount factors (the present value of $1 received at each future year)
DF1 = 1/(1+Y1), then each subsequent DF is solved from the no-arbitrage pricing equation

Step 4 — Solve for Step-Up Coupons

Find coupon rates C1, C2, ..., C10 such that the bond prices are at par for every holding period
This means: an investor who holds for N years and redeems at par earns the same return as an N-year SGS bond
C1 = Y1, then each subsequent coupon is solved forward using the discount factors

Step 5 — Enforce Monotonicity

If the raw coupons don’t step up (e.g. inverted yield curve), adjust them:
- Minimize the pricing error across all tenors
- Subject to: coupons never decrease, 10Y return is preserved exactly
Round to 2 decimal places

Data Sources

SGS benchmark yields: MAS Benchmark Prices and Yields
Published SSB rates: MAS Step-Up Interest Rates (127 bonds, Oct 2015 – Apr 2026)
Technical specification: SSB Technical Specifications, Section 4

Fraud Management and the Pricing of Tail Risk

KY John — Sun, 08 Mar 2026 00:59:43 GMT

As a new fraud manager, I started reflecting on the fundamental principles of fraud management. I began by looking into the metrics commonly used by fraud teams, such as precision, FPR, and recall. However, I soon realized how difficult it is to set KPIs in fraud because of a core paradox: absence of evidence is not evidence of absence.

In this blog post, I want to go one step further and discuss the causes of this paradox, and why I believe it is extremely dangerous to optimize fraud management primarily around precision, FPR, and recall.

Fraud events often have the following characteristics:

The frequency of fraud is small relative to the total number of events. At least, if the business is still surviving.
The loss from fraud is often many orders of magnitude larger than other types of loss, such as credit loss per event. A single fraud incident can wipe out a month’s profit.
The real damage usually comes from previously unknown attack vectors.
Fraud events happen more often than we think. When we say “low frequency,” we may imagine once every few years. In reality, they may happen much more often.

The consequences are as follows.

First Order Consequences

A 99.99% recall may still not be enough, because the remaining 0.01% of uncaptured fraud can still put the business at risk.
The absence of observed fraud events can create a false sense of security, even though only one disconfirming event is needed to reject the proposition that we are safe.
Thinking only in terms of frequency can put us in grave danger. First, we may assume fraud events are like 4-sigma events—rare and exceptional—when in fact they happen far more often, perhaps even every day. Second, even a truly rare event can still hurt the business badly because of the scale of exposure.
Fraud events cannot be treated in isolation, because the business may never fully recover from a large incident. There is path dependency.

Second Order Consequences

Focusing too much on precision, FPR, and recall is dangerous. These metrics can miss the small fraction of unknown risks that may cause losses 1000 times larger than what we are prepared for.
Similarly, machine learning models are not sufficient for dealing with extreme unknowns. Over-focusing on deploying more advanced models can shift attention away from what truly matters.
Average risk can be highly misleading. It changes dramatically once an outlier occurs, and those outliers are often both more frequent and much larger than people expect. Monitoring average risk alone can create a false sense of security.
Frequency-based caps, such as the number of transactions allowed, are conceptually incomplete because a single event can still wipe you out.

Strategy

Are we doomed then? Not necessarily.

Although it is nearly impossible to predict or control exactly when fraud will happen, exposure can still be controlled. I would argue that the primary responsibility of a fraud team is to cap the downside of the business. In the process, the team should also aim to reduce the premium paid for that protection—for example, by lowering false positives. But uncapped exposure is non-negotiable.

A cap may look too simple, but we are not here to impress our peers.

Finally, fraud risk practitioners deserve more respect. In many ways, the fraud team is effectively the seller of a put option, while the business is the buyer. The fraud team’s upside is capped. The best outcome it can possibly deliver is close to 100% recall, which is impossible to achieve in practice. Yet its downside is theoretically uncapped. When an extreme event happens, the fraud team is often the one blamed.

At the same time, the premium paid for this protection—not in salary, but in lost business opportunities—is often mispriced relative to the guarantee the fraud team is expected to provide. The business tends to focus on frequency: fraud is unlikely, so why sacrifice business volume to manage a “long-tail” event? Fraud teams, by contrast, have to focus on magnitude.

AI in My Daily Workflow: Use Cases and Reflections

KY John — Fri, 20 Feb 2026 06:17:33 GMT

Use Cases

Understand a codebase quickly. DeepWiki provides a high-level, human-readable overview of any codebase — far more efficient than reading raw code. You can also ask questions directly about the project structure and logic.

Automate workflows and solve problems computationally. Claude Code is my go-to here.

Recently, I used Claude Code (with Agent Teams) to generate personal financial statements from bank, broker, and credit card statements. The result was surprisingly good. I spent about one working day polishing the output — something that would have taken me weeks to build from scratch.
Previously, I built a property listing monitor that pulls new listings from targeted developments and sends updates via Telegram. (Github)

Q&A — information retrieval and brainstorming. I use Gemini across the board: text, images, and video (directly on YouTube). Frontier models are converging in capability, and Gemini offers better value for money. The bundled 2TB Google Drive is a bonus I always end up using anyway.

Understand a new subject quickly. My workflow:

Find resources via Google Scholar Labs, Gemini Deep Research, or Zhihu AI Chat (知乎直答).
Upload selected resources to NotebookLM and start asking questions.

I recently used this to get up to speed on group fraud detection as part of a push to take on broader antifraud responsibilities — what might have taken weeks of reading compressed into a few focused sessions.

Build a personal knowledge base. I store personal notes in Markdown, use Claude Code to build a RAG server on top of them, and pipe in RSS feeds converted to Markdown. Everything becomes searchable and queryable.

Reflections

AI has fundamentally changed how I learn. I spend far less time on Google Search — only when I need to trace a primary source. The structured summaries AI produces help me grasp the shape of a subject much faster than stitching together search results ever did.

AI is already very capable at coding. More importantly, coding is just a means to an end — a method for solving problems. AI lowers the barrier between having a problem and implementing a computational solution. If something can be solved programmatically, AI dramatically accelerates getting there.

One thing I’ve noticed with Claude Code specifically: memory and context have improved significantly. Previously, every new session started from scratch with no accumulated context. Now, the agent can automatically build a memory file or be explicitly prompted to maintain one across sessions. It feels more like working with a collaborator than restarting a tool.

That said, QA and critical thinking remain distinctly human strengths. AI output still needs to be sense-checked. Common sense matters — the ability to notice when something doesn’t add up, when a result feels off, when the logic is internally consistent but wrong in practice. Judgment about what’s right is still something humans do better.

Looking further ahead, once AI reliably solves coding, it gains the ability to improve itself. At that point, the barrier to completing any cognitive task effectively disappears. The human role shifts toward defining objectives — what we want the world to look like, what problems are worth solving, and what the company should be doing. AI handles the execution.

But there’s a second, subtler role: QA at the objective level. We’ll need to examine AI-proposed solutions and verify they actually align with what we intended — until AI develops its own objectives, and becomes sophisticated enough that the misalignment isn’t obvious. That’s the part worth thinking carefully about now.

The Risk Metric Translation Layer: Why Precision and FPR Aren't Mirror Images

KY John — Sat, 07 Feb 2026 08:10:00 GMT

Sometimes, I found it inefficient to communicate the antifraud team's performance metrics, such as precision and false positive rate, to other teams or even within the antifraud team. It is because those words are also used loosely in everyday communication. I said, “Our policy precision is 90%.” The counterpart replied, “10% false positive rate is too high to accept.” You know there is a gap in the understanding of precision and false positive rate (FPR).

On the business side, when they say “10% FPR is too high,” they mean: out of all the good users, how many are falsely labelled as bad? They worried about user experience. On the risk side, when we say “precision is 90%”, we mean: out of all the users who are being labelled as bad, how many are actually bad?. We are concerned about the performance of the models/rules themselves. By deriving the relationship, we can bridge the gaps between two metrics.

Let’s revisit the two-room analogy for intuitive understanding. FPR only cares about the good users’ room (N). If FPR increases, the absolute number of false positives increases. Precision [TP/(TP+FP)]will drop as well because the denominator became larger. But let’s keep precision constant since we want to find the relationship between the two. This forces the model to capture more TP. In other words, recall (TP/P) must increase. Finally, since FPR and recall are just ratios, the relative number of N and P also affects the relationship, a.ka. prevalence [P/(P+N)]. To summarize, precision and FPR are connected by recall and prevalence. 90% precision does not imply an FPR of 10%, because we must also account for recall and the natural fraud rate (prevalence).

Mathematically, we can derive the relationship as:

where theta is the prevalence.

Consider the following scenarios to consolidate the intuition, assuming the model is reasonably good (both precision and recall are high):

FPR is low, and precision is high when prevalence is low.
FPR and precision are high when prevalence is high.

You may already notice that even if model performance is poor, FPR can be low if prevalence is very low. For example, if the prevalence is 0.1%, the recall is 10%, the precision is 20%, and the FPR is only 0.04%.

Finally, the question is which number should we use. I think both should be reported separately, as the audiences are different. For the risk team, we should focus on precision and recall, as they are directly linked to model/policy performance. However, we should also check FPR as it affects the user experience.

Absence of Evidence ≠ Evidence of Absence: Rethinking Fraud Prevention KPIs

KY John — Sat, 24 Jan 2026 02:00:54 GMT

The Core Paradox

The Fundamental Challenge: Low incident rates in fraud prevention could indicate either:

Strong defenses - The system is working effectively
Lack of attempts - There simply haven’t been many fraud attempts

This ambiguity makes it difficult to assess the true effectiveness of fraud prevention measures.

The Unknown Unknowns Problem

Beyond measuring what we know, there’s a deeper challenge: you cannot directly measure what you are unaware of. This is the classic “unknown unknowns” dilemma in security and fraud prevention.

Bayesian Framework

To address these challenges, we can apply Bayesian reasoning:

Interpreting the Evidence

When the likelihood is high: P(No Incident∣Fraud) is high
- The absence of fraud incidents strongly indicates effective fraud prevention
When the likelihood is low: P(No Incident∣Fraud) is low
- The Absence of incidents provides little information about fraud prevention effectiveness

Measuring the Unmeasurable

The Challenge

Evaluating the likelihoods requires:

Understanding the threat landscape
Assessing coverage of known fraud methods
Accounting for unknown fraud methods (which cannot be directly measured)

Indirect Measurement Methods

Approaches to Discover Unknown Unknowns

When direct measurement is impossible, we can use these indirect methods:

Red Teaming - Simulate adversarial attacks to find vulnerabilities
External Threat Intelligence - Learn from industry patterns and breaches
Post-Mortem Analysis - Learn from failures when they occur
First Principles Decomposition - Break down attack surfaces systematically

Practical Framework

Measuring Fraud Prevention Success: To measure the success of fraud prevention in low incident scenarios, focus on:

Coverage Metrics
1. Percentage of known fraud methods with active defenses
2. Depth and breadth of detection capabilities (Cost of attack)
Discovery Rate
1. Incremental discovery of new fraud methods over time
2. Rate of vulnerability identification and remediation
Process Interception Metrics
1. Trigger rate: Proportion of applications/transactions flagged by rules
2. Hit rate: Proportion of triggered cases with confirmed abnormal behaviors (grey areas go to manual review)

This approach helps infer effectiveness even when direct incident data is sparse.

Can You Solve This? Calculating KS from a Quarter-Circle ROC Curve

KY John — Sun, 11 Jan 2026 11:06:47 GMT

I recently came across an interesting interview question that bridges geometry and risk modeling: “If a model’s ROC curve is a perfect concave down quarter-circle, what is its KS value?”

In my last post, I discussed the relationship between AUC and KS. Let’s build on that foundation to tackle this new question. Let’s recap the key concepts:

ROC Curve: A plot of the true positive rate against the false positive rate. A good model’s ROC curve bows towards the top-left corner because it maximizes true positives while minimizing false positives.
KS Statistic: The maximum vertical distance between the cumulative distribution functions (CDFs) of the positive and negative classes. It quantifies how well the model separates the two classes.

The key insight to solving this problem lies in the fact that the true positive rate is the same as the cumulative distribution function (CDF) for the positive class, and the false positive rate is the CDF for the negative class. Therefore, the KS statistic can be expressed as: KS=max(TPR−FPR). (Please refer to my previous post for a detailed explanation of this relationship.)

The equation for a circle centered at (h,k) with radius r is: (x-h)^2 + (y-k)^2 = r^2. For a quarter-circle ROC curve centered at (1,0) with radius 1, the equation simplifies to: (x-1)^2 + y^2 = 1. Rearranging gives us:

To find the KS value, we need to maximize y−xy−x over the interval [0,1]. This leads us to the function:

. Taking the derivative and setting it to zero gives us the critical points:

Solving this equation, we find:

(Discarded another solution because the range of x is [0,1]).

Plugging this back into our function, we calculate:

If you can follow this reasoning and arrive at the same conclusion, congratulations! You’ve successfully connected geometric intuition with statistical modeling. The key learning of the question is that KS alone does not fully capture model performance. It only anchors the shape of the ROC curve.

Same Math, Different Enemies: The Unit Economics of Credit and Fraud Risk

KY John — Sun, 28 Dec 2025 02:00:21 GMT

In my previous post, I discussed standard risk metrics like AUC, KS, Precision, and Recall. But metrics are just a means to an end. How do the actual decision-making processes differ between Credit Risk and Anti-Fraud?

Having recently rotated from the Credit Risk team to the Anti-Fraud team, I’ve realized that while the terminology differs, the fundamental logic—derived from the first principle of Maximizing Profit—is remarkably similar. Below, I demonstrate the mathematical unity of the two domains and highlight the key strategic differences.

The Shared Math: Maximizing Incremental Value

Let’s start with the basics. Total profit equals the revenue from good users minus the losses from bad users.

True Negatives (TN): Approved users who pay (Profit).
False Negatives (FN): Approved users who default/fraud (Loss).

Define positive as “we block/reject because predicted bad,” and negative as “we approve.

We can express this as:

Where R is Revenue per Good User and L is Loss per Bad User.

However, this equation is too static for strategy design. To decide whether to approve or reject a specific segment, we need to look at the Counterfactual: What is the value added by our decision model compared to doing nothing?

If we compare our strategy against a baseline of “Approving Everyone,” the Total Profit can be rewritten as:

N_good,N_bad: Total population of good and bad users (Constant).
TP⋅L: The loss saved by correctly blocking bad users.
FP⋅R: The revenue sacrificed by incorrectly blocking good users.

Since the Baseline Profit is constant (determined by the population), maximizing Total Profit is mathematically identical to maximizing Incremental Strategy Value.

The Optimality Condition: MC = MB

Economics teaches us that profit is maximized when Marginal Cost equals Marginal Benefit.

Let p^ be the probability that the next user we assess is “Bad.”

Marginal Benefit: If we block them and they are Bad, we save L.
Marginal Cost: If we block them and they are Good, we lose R.

We should stop blocking exactly when the cost outweighs the benefit:

This simple ratio, R/(L+R), is the “Universal Cutoff.” But the two teams view it from opposite ends.

1. The Anti-Fraud View (The Defender)

In fraud, we “block” transactions.

Revenue (R): The cost of insulting a good customer (C_insult), which is the interest loss.
Loss (L): The fraud loss saved (C_fraud), which is the principal loss.

The probability p^here represents the probability of fraud—which is exactly the definition of Precision for that marginal alert.

If the model’s precision drops below this threshold, the cost of insulting good customers exceeds the value of the fraud we stop.

A Note on Terminology: Why does Probability (p^) equal Precision?

It might seem confusing to equate a single user’s probability score (p^) with a group statistic like Precision. However, at the margin, they are identical. Simply apply the Law of Large Numbers:

Imagine our model assigns a risk score of 0.20 to a specific transaction. This prediction literally means: “There is a 20% chance this specific transaction is fraud.” , assuming the model is well-calibrated.

Now, imagine we gather 100 transactions that all share this exact risk score of 0.20.

Expected Fraud (TP): 20
Expected Legitimate (FP): 80

If we block this specific bucket of users, the Precision is:

Thus, the Probability Score at the cutoff is simply the Marginal Precision of the decision at that cutoff.

2. The Credit Risk View (The Attacker)

In credit, we “approve” loans.

Revenue (R): Interest Income.
Loss (L): Principal Loss.

The probability p^ here represents the Probability of Default (PD).

If the user’s PD rises above this threshold, the expected principal loss exceeds the potential interest income. (Of course, this is a simplified way to calculate revenue and cost but it captures the essence.)

Conclusion: Same Line, Opposite Directions

Mathematically, Minimum Precision and Maximum PD are the exact same number.

The Anti-Fraud team defends the gate, blocking bad actors until the precision drops to the cutoff.
The Credit Risk team expands the market, approving users until the risk rises to the cutoff.

The Strategic Divergence: Different Enemies

If the math is identical, why do the jobs feel so different? Because while the equation is the same, the enemy is not.

1. Definition of “Bad”

Credit Risk: “Bad” is defined solely by default.
Anti-Fraud: “Bad” is defined by intent (Deception). Thus, default alone is not sufficient. The team will also look for suspicous patterns.

2. Static vs. Dynamic (Game Theory)

This is the most critical difference.

Credit Risk mostly plays against Nature. Borrowers are relatively stable. A user with a 620 credit score today behaves similarly to a 620 user yesterday. Historical “Vintage” data is highly predictive of the future.
Anti-Fraud plays against an Adversary. Fraudsters are intelligent, coordinated, and reactive. If you set a static rule to block X, they immediately shift to Y.
- Implication: Credit teams optimize for Efficiency (Calibration). Fraud teams must optimize for Adaptability (Exploration).

3. The Action Space

Anti-Fraud: The decision is usually binary (Approve vs. Reject) or friction-based (Step-up Verification).
Credit Risk: The decision is multi-dimensional. We can manage risk not just by rejecting, but by adjusting the Credit Limit, Tenure, or Pricing (EIR). We have more levers to force the unit economics to work.

4. The Entity Dimension (Multi-Modal Risk)

Credit Risk is almost exclusively User-Centric. We underwrite the person (or the business entity) applying for funds. The unit of analysis is stable.
Anti-Fraud is Multi-Modal. We don’t just assess the User; we assess the Device, the IP address, the Credit Card, and the Merchant.

The Economic Implication: The variables in our profit equation (C_insult and C_fraud) shift drastically depending on what we are blocking.

Blocking a Device: If I block a suspicious device ID, the CinsultCinsult might be low (the user is annoyed but can switch devices). Result: I can afford a lower precision threshold.
Blocking a Merchant: If I block a merchant in a marketplace, I cut off revenue from all their customers. The C_insult(Lost Revenue) is massive. Result: I need an extremely high precision threshold—often requiring manual review—before pulling the trigger.

While Credit Risk optimizes one curve (User Risk), Anti-Fraud constantly juggles multiple curves with different breakeven points.

Closing Thoughts

I am still very much a learner in this space, but realizing that Credit Risk and Anti-Fraud share the same Unit Economics has been a helpful anchor for me.

It means that while I am learning new tactics (Game Theory, Pattern Recognition), the underlying grammar of Profit Maximization remains the same.

AUC, KS, Precision, and Recall: A Risk Analyst’s Guide

KY John — Sat, 20 Dec 2025 02:00:41 GMT

As analysts in the risk management industry, we live and breathe acronyms like AUC (Area Under the Curve) and KS (Kolmogorov-Smirnov). These are the gold standards for evaluating classification models in credit risk. However, if you—like me—have recently rotated into an Anti-Fraud team from credit risk team, you’ve likely encountered two different metric kings: Precision and Recall.

How do these metrics relate to each other?

The “Two Rooms” Analogy

The standard confusion matrix terms—True Positive (TP), False Positive (FP), True Negative (TN), and False Negative (FN)—can be headache-inducing to communicate in practice.

To simplify this, imagine two separate rooms:

The Bad Room: Contains only actual bad actors (Total Bad = TP + FN).
The Good Room: Contains only actual good users (Total Good = TN + FP).

Your model is a gatekeeper. You walk into each room with your classifier.

In the Bad Room, you want the model to identify everyone as bad. The percentage of people you successfully catch here is your True Positive Rate (TPR).
In the Good Room, you want the model to identify no one as bad. The percentage of people you mistakenly flag as bad here is your False Positive Rate (FPR).

A perfect model has a TPR of 100% (catches everyone in the Bad Room) and an FPR of 0% (flags no one in the Good Room).

Visualizing AUC and KS

Realistically, models output a probability (e.g., “There is a 90% chance this user is a fraud”). We use a “threshold dial” to decide who to flag.

Imagine turning this dial:

Threshold 100% (Strict): You only flag apparent fraud. You catch almost no one in the Bad Room (TPR≈0), but you also annoy no one in the Good Room (FPR≈0). This is the point (0,0)on the ROC plot.
Threshold 0% (Loose): You flag everyone. You catch every fraudster (TPR=100%), but you also falsely flag every good user (FPR=100%). This is the point (1,1).

AUC (Area Under the Curve) measures the model’s performance across all possible settings of this dial. It plots TPR against FPR. A random guess gives you a straight diagonal line (AUC 0.5), meaning for every extra bad person you catch, you annoy a proportional number of good people. A good model bows upward, maximizing the gap between the TPR and FPR.

KS (Kolmogorov-Smirnov) focuses on the single best point on that curve. It is simply the maximum difference between TPR and FPR. While AUC looks at the whole story, KS asks: “At the single best setting of the dial, how much separation can we get between the good and bad populations?”

Deep Dive: Why KS is Cumulative

KS is usually plotted using the cumulative number of bads and goods. The maximum difference between the two curves is the KS.

Imagine taking all the people from both rooms and lining them up in a single queue, sorted by their model score from Highest (Most Suspicious) to Lowest (Least Suspicious).

Now, imagine walking down this line from the start. This is equivalent to lowering your threshold.

Cumulative Bad (TPR): Every time you walk past a Bad Person, you add them to your count. If there are 100 Bad People total, and you have passed 50 of them, your “Cumulative Bad %” is 50%. This is exactly the True Positive Rate (TPR).
Cumulative Good (FPR): Every time you walk past a Good Person, you add them to your count. If there are 1,000 Good People total, and you have passed 100 of them, your “Cumulative Good %” is 10%. This is exactly the False Positive Rate (FPR)

Below is a simulation dashboard I asked Gemini to build to visualize these concepts: Link

Why Anti-Fraud Cares: Precision and Recall

In the anti-fraud world, Recall is just another word for TPR (Bad Room coverage). But Precision is different.

While TPR and FPR require you to look at the rooms separately, Precision requires you to look at who the model flagged from both rooms combined. It asks: “Of all the people the model claimed were bad, how many were actually bad?”

Why is this preferred over FPR in fraud operations?

Operational Reality: In fraud, a positive flag usually triggers a manual review, an SMS alert, or a transaction block. These actions have direct costs (agent time) and customer friction (insult). Precision measures the “purity” of the alert queue.
The Class Imbalance Problem: This is the key differentiator. Precision is highly sensitive to the Bad Rate.
- FPR is calculated only inside the Good Room. If you double the number of good customers, the FPR remains stable.
- Precision depends on the ratio of good to bad. If the number of good customers explodes while the number of fraudsters stays the same, your Precision will plummet because the “noise” (False Positives) drowns out the signal.

In summary, the Modeling Team often focuses on AUC/KS because they measure the model’s pure ability to rank order, independent of the portfolio’s bad rate. Anti-Fraud focuses on Precision because it reflects the actual operational pain of sifting through false alarms in a sea of good transactions.

Why does the credit risk team seldom look into Precision and Recall?

This question deserves a dedicated deep dive in the next post.

Fundamentally, both Underwriting and Anti-Fraud teams share the exact same goal: maximizing profit. They simply approach the P&L equation from opposite ends (Alert: Fraud is a dynamic process. The environment is not stable because the fraudster evolves.):

Underwriting aims to maximize disbursements. They expand approvals until the marginal cost of defaults outweighs the marginal revenue from interest. Their constraint is the Breakeven Cost of Risk.
Anti-Fraud aims to minimize losses. They expand fraud detection until the cost of friction (insulting good customers) outweighs the savings from stopping fraud. Their constraint is Breakeven Precision.

In my following note, I plan to demonstrate mathematically that these two concepts are identical: the Breakeven Cost of Risk in lending equals the Breakeven Precision in fraud prevention.

It’s Not the Chaos, It’s the Expectation: A Framework for Deconstructing Anxiety

KY John — Sat, 13 Dec 2025 01:00:45 GMT

Recently, I’ve spent time reflecting on the problem of anxiety. It’s something many of us deal with, often feeling like a vague, overwhelming fog.

But when I sat down to analyze my own experiences—trying to pinpoint exactly where that feeling comes from—I realized that anxiety isn’t usually random chaos. It’s almost always structural. It stems from specific flaws in the mental models I use to navigate the world.

If we can identify the roots of the anxiety, we can build a framework to handle it.

Through my reflection, I identified three main sources of anxiety, how they interconnect, and a two-step approach to regain our footing.

Part 1: The Diagnosis (Where Anxiety Comes From)

In my experience, anxiety isn’t usually caused by the event itself, but rather my relationship to the event. It almost always stems from one of these three situations:

1. The Expectation Gap (The Illusion of Control)

We often hear that anxiety comes from a “loss of control.” But that’s only half the story. Before you can lose control, there must be an expectation that you had it in the first place.

Anxiety thrives in the gap between our expectations and reality. Sometimes, we set the bar impossibly high, wanting control over things that are inherently uncontrollable. When we expect to be in the driver’s seat and reality suddenly grabs the wheel, panic sets in. The anxiety isn’t just about what is happening; it is the friction caused by our resistance to reality.

2. The Value Vacuum (Not Knowing What You Want)

If you don’t have a clear internal hierarchy of values—knowing exactly what you want and what is truly important to you—you become a reactive vessel for external pressures.

We live in an era of massive information flow and endless options. Without a strong internal compass, everything feels equally important. The urgent drowns out the important. This leads to chronic overwhelm as we try to juggle conflicting priorities that we didn’t even choose for ourselves.

3. The Social Mirror (External Validation)

This is perhaps the most paralyzing source: putting too much emphasis on other people’s views of us.

When we don’t know our own value (point #2), we outsource the measuring of it to society. We look into the “social mirror” to see if we are okay. The problem is that the mirror is constantly changing, and we have zero control over what others think. Basing your stability on something unstable is a recipe for constant anxiety.

Part 2: The Prescription (How to Handle It)

So, how do we break this cycle? The solution requires flipping our operating system. Instead of looking outward for cues, we must start inward.

Step 1: Define Your “Self” First (First Principles Thinking)

To handle anxiety, you must make your own self the priority. Before you look at the world, you have to look in the mirror and ask: What do I actually want? What is genuinely important to me?

This is difficult. We are conditioned by social norms and existing incentive systems. We often treat these systems (like career ladders or social expectations) as rigid boundaries.

But we must remember that these systems were designed by other people to put constraints on behavior. Don’t treat them as immutable laws of physics before you even know what you want.

The shift: Figure out what you value first. After you know what you want, then you can look at the societal constraints and decide consciously whether you want to work within them or break them. You become the actor, not the reactor.

Step 2: The Spheres of Control

Once you know what you want, the final step is calibrating your effort. While the Stoics famously spoke of the “dichotomy of control” (what is yours vs. what isn’t), I find it more useful to split control into three distinct rings.

Ring 1: Direct Control

This is your internal territory. It includes your actions, your effort, what you say, and how you allocate your time. This is the only place where you have total agency.

Ring 2: Influence

This is the gray area where most anxiety lives. It includes relationships, team decisions, negotiations, and probabilities. You can affect the outcome here, but you cannot dictate it.

Ring 3: No Control

This is the environment. It includes the macro economy, the weather, the past, and other people’s internal states.

How this fixes anxiety:

Anxiety is usually a result of trying to apply Ring 1 energy to a Ring 3 problem. To reduce it, you must map your worries to the correct ring:

In Ring 1: Act strongly. Focus on high-quality input and effort.
In Ring 2: Experiment and negotiate. Influence the odds, but detach from the guarantee.
In Ring 3: Practice acceptance. Observe it without trying to change it.

Moving Forward

Anxiety often feels like a defect, but I’m starting to view it as a signal.

It’s a signal that my expectations are out of alignment with reality, or that I’m trying to control the uncontrollable. By using this framework to diagnose the source and placing my efforts in the correct ring, the fog begins to lift.