<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Tal Perry</title><description>Shouting into the void , now with AI.</description><link>https://talperry.com/</link><language>en</language><copyright>Copyright 2026, Calvin Tran</copyright><lastBuildDate>Fri, 15 Mar 2024 09:41:38 +0100</lastBuildDate><generator>Hugo - gohugo.io</generator><docs>http://cyber.harvard.edu/rss/rss.html</docs><atom:link href="https://talperry.com//atom.xml" rel="self" type="application/atom+xml"/><item><title>Engineering Agents is really UX Engineering</title><link>https://talperry.com/en/posts/genai/engineering-agents-building-trust/</link><description>&lt;p>Over the weekend I built an AI agent. I thought this was an engineering problem, but I now think it is a user experience problem.&lt;/p>
&lt;p>In regular software, we solve engineering problems with UX. If a page loads slowly, we show a spinner so the user feels like something is happening.&lt;/p>
&lt;p>In agentic software, we create UX problems with engineering decisions.&lt;/p>
&lt;p>If we take some data out of the chat history, but hint at that data in the UI, the user will think the agent is dumb and will stop trusting it. If we do not manage the agent’s “memory” carefully, the user will feel like they are talking to a different person every time. User trust collapses when the UI, and the data informing it, diverge from the agent’s memory.&lt;/p>
&lt;p>But I am getting ahead of myself.&lt;/p>
&lt;p>We rent an apartment in Berlin, and as the kids grow we need to rearrange their rooms. I hate this task because I am so bad at it.&lt;/p>
&lt;p>I cannot measure. I cannot design. Since it is a rental, it is not worth spending big bucks. And the worst part is searching for furniture that fits exactly, one slow click at a time.&lt;/p>
&lt;p>So I built an agent that measures, makes floor plans, figures out what is wrong with our layout, checks what fits, and orders from the catalog.&lt;/p>
&lt;p>In order to use it, I just need to trust it.&lt;/p>
&lt;h2 id="whats-inside">What&amp;rsquo;s inside&lt;/h2>
&lt;p>Two themes:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Choosing tools for building Agents&lt;/strong>:
Looking back on my research and choices, should you use an agent framework (Yes!)? Which one (They&amp;rsquo;re not really differentiated)? What matters (a chat UI from day 1 + recording of every interaction)?&lt;/li>
&lt;li>&lt;strong>Engineering Agents is user experience design&lt;/strong>:
Users need to trust their agents. The wrong engineering decisions will make the agent feel dumb, erode user trust, and kill your KPIs, company, and reputation.&lt;/li>
&lt;/ul>
&lt;h2 id="context-state-and-memory">Context, State, and Memory&lt;/h2>
&lt;p>These are confusing words that are confusingly used together, so let me define them the way I am thinking about them.&lt;/p>
&lt;ul>
&lt;li>Context: the text the agent is currently working with&lt;/li>
&lt;li>Memory: more text the agent can retrieve or “remember” if it needs to&lt;/li>
&lt;li>State: the rest of the data in the application that influences the user’s perception of what is going on. The current floor plan, the progress bar, whether the agent is waiting for input, and so on.&lt;/li>
&lt;/ul>
&lt;p>When I use a coding assistant, memory feels a lot like context, because a coding agent running on your machine can just use the file system as memory. You and Claude operate on a shared data plane: the file system. If the agent writes a class or creates a folder, that change creates a persistent artifact that both of you can see.&lt;/p>
&lt;p>In my furniture planner, there is no such shared file system.&lt;/p>
&lt;p>The agent lives inside a webapp that does other things. The user and the agent are collaborating through chat to create assets—like a floor plan—that need to persist outside the chat stream. So I had to build a persistence layer that keeps both the application state and the conversational memory alive, and keeps them in sync.&lt;/p>
&lt;p>That sounds technical, but it is really a product problem.&lt;/p>
&lt;p>First, there is the question of continuity. If I start a new conversation, should the agent remember the floor plan? Probably. But what about the commentary around it?&lt;/p>
&lt;p>Suppose the user previously said: “My wall is 350x400 and the unusual height makes me feel cold and uncomfortable.”&lt;/p>
&lt;p>Persistence will happily store the wall dimensions, so we have them in the app. But the user’s remark about the unusual height, and how it feels, is easy to lose, as those typically only get stored in the message history.
Next time we load the project, the user will think it is obvious that this wall is part of the problem. The agent will likely just accept the dimensions as facts and fail to understand the emotional context that should steer the conversation. The user will feel like the agent is not keeping track of what it&amp;rsquo;s supposed to remember, and that it is just a dumb tool. That is a trust problem.&lt;/p>
&lt;p>Second, there is the synchronization problem. What does the UI think is happening, versus what the agent thinks is happening?&lt;/p>
&lt;p>If the UI renders a floor plan, we naturally assume the agent is aware of it. But if the user tweaks the plan in the UI, does the agent know? Is the agent’s context synchronized with the database state? Is what the user sees the same thing the agent is reasoning from?&lt;/p>
&lt;h2 id="choosing-tools-for-building-agents">Choosing Tools for Building Agents&lt;/h2>
&lt;p>I started this project thinking the big technical question would be: what framework should I use?&lt;/p>
&lt;p>I now think that question matters less than I expected. What mattered the most is the ease of integration with UI. Whatever form your agent takes, it&amp;rsquo;s very helpful to interact with it in that form and experience it as the user would. My agent had a lot of media to navigate and display, so a web page with a chat made sense. Having that out of the box from my framework (Pydantic) was a huge help.&lt;/p>
&lt;p>As an example, what is the user experience while my agent searches the Ikea catalog? Do we just display a spinner indefinetly? Of course not! There is a standard UI pattern for &amp;ldquo;tool calls&amp;rdquo;, and I am thrilled to not have rediscovered or reimplemented it myself. I used CopilotKit&amp;rsquo;s agent UI frameowkr and it made life easy.&lt;/p>
&lt;h2 id="multiple-agents--multiple-personalities--hard-to-trust">Multiple Agents == Multiple Personalities == Hard to Trust&lt;/h2>
&lt;p>Once I had the UI and the state problems in view, another issue became obvious: multiple agents.&lt;/p>
&lt;p>In my workflow, I want different kinds of agent behavior at different times:&lt;/p>
&lt;ul>
&lt;li>one that specializes in measurement&lt;/li>
&lt;li>one that specializes in understanding needs&lt;/li>
&lt;li>one that matches those needs against the design catalog&lt;/li>
&lt;li>one that does constrained optimization to make sure the solution is actually feasible&lt;/li>
&lt;/ul>
&lt;p>From an engineering perspective, it is very tempting to make these isolated components: multiple agents that don’t share state. One feeds the next. It’s easier to reason about and easier to debug.&lt;/p>
&lt;p>The alternative is a single agent that knows everything, remembers everything, and does everything. That’s harder to engineer, but it has one huge advantage: it feels like one coherent actor from the user’s perspective.&lt;/p>
&lt;p>Generating the floor plan is not a one-shot transformation. It is a conversation. The agent asks for information, draws something, the user says “no, the door is on the other side,” or “you forgot the window,” or “there is a couch here.” The agent refines, asks follow-up questions, interprets corrections, and keeps going until the user is satisfied.&lt;/p>
&lt;p>For that sub-workflow, I don’t really want the whole rest of the system involved. I don’t want every tool, every piece of state, every message, all crammed into that one loop. I just want it to stay in that conversation until the floor plan is good enough.&lt;/p>
&lt;p>But here’s the trap: when that separation shows up to the user as “forgetting,” it destroys trust.&lt;/p>
&lt;p>If the floor-planning agent doesn’t remember something the user already said earlier in the broader chat, it looks dumb. The user does not think “ah yes, I see I have crossed a subsystem boundary.” They think the agent forgot.&lt;/p>
&lt;p>This is why I say multiple agents can feel like multiple personalities. Even if the decomposition is elegant internally, it can feel like talking to someone with selective amnesia.&lt;/p>
&lt;p>Right now, this is a me-facing tool, so I’m not overly worried. But as soon as the user is meant to trust the system directly, this becomes a major issue.&lt;/p>
&lt;h2 id="prompt-management-and-evals">Prompt Management and Evals&lt;/h2>
&lt;p>Another place where the “one coherent actor” illusion breaks is prompting.&lt;/p>
&lt;p>Once the high-level agentic workflow is there, the prompt(s) are what drive nuance of behaviour.&lt;/p>
&lt;p>For example, I added a semantic search layer over the product catalog.&lt;/p>
&lt;p>I prompted the agent to run 5 or 6 variations of a concept. So instead of searching only for “plants,” it might search for “low-light plants,” “shadow-loving greenery,” “bathroom plants,” and so on.&lt;/p>
&lt;p>That diversity mattered a lot. It forced the agent to explore the catalog instead of lazily grabbing the first plausible results.&lt;/p>
&lt;p>Then one day, that part of the prompt disappeared.&lt;/p>
&lt;p>The code was fine. The system still worked. But the quality dropped hard, because the agent stopped searching broadly and started doing the lazy obvious thing.&lt;/p>
&lt;p>That kind of regression is hard to catch. “Generate several related but distinct search queries” is not a crisp unit-testable behavior. It is a qualitative behavior. The code can be perfectly correct while the intelligence degrades.&lt;/p>
&lt;p>That is why prompt management matters. I want to know the state of the prompt. I want to version it. I want to understand how one piece of prompting affects one behavior. Otherwise the whole system turns into a clot of instructions that is impossible to reason about.&lt;/p>
&lt;p>Modern coding assistants are a good analogy. They have a system prompt, then extra context files, then tool definitions that add more instructions. Prompting is no longer one blob. It is compositional. That same complexity exists in the agents we build ourselves.&lt;/p>
&lt;h2 id="defining-behavior-user-stories-and-capabilities-over-specs">Defining Behavior: User stories and Capabilities over Specs&lt;/h2>
&lt;p>In normal software work, I have had the best luck with agents when I give them detailed technical specs. The spec says how to test, how to implement, what constraints matter, what patterns to follow.
But when engineering agents, where engineering errors erode &lt;em>user trust&lt;/em> in the agent, then user stories and agent capabilities replace specs as the cornerstone of planning and design.&lt;/p>
&lt;p>Agent capabilities might sound like &lt;code>Skills.md&lt;/code>, but it&amp;rsquo;s not the same. A skill spec says &amp;ldquo;here is how to do x and what you need to do it&amp;rdquo;. Describing a capability is more like saying &amp;ldquo;I want you to solve this problem&amp;rdquo; and expecting the agent to fill in the blanks.&lt;/p>
&lt;p>For example, I might define a capability like Estimate Room Measurements from Photos. I want the agent to ask for photos, combine them, and infer dimensions of the room and the objects in it. I do not necessarily want to start by prescribing: use semantic segmentation, depth estimation, plane detection, vanishing points, and so on.&lt;/p>
&lt;p>Partly that is because I was working in areas where I did not know enough to write the right low-level spec. Computer vision, frontend, etc.&lt;/p>
&lt;p>But partly it is because the agent is a user-facing product. The important thing is not the internal recipe. The important thing is the behavior: what it helps the user accomplish, how it asks questions, what it does when it is uncertain, and what result it is trying to produce. What do we need to validate between releases? Not the &amp;ldquo;how&amp;rdquo; of how it reaches an answer but the &amp;ldquo;what&amp;rdquo;, what was the answer and was it right?&lt;/p>
&lt;h2 id="the-happy-path-strategy">The Happy Path Strategy&lt;/h2>
&lt;p>One practical way I managed the chaos of user stories for agents was by defining a very rigorous happy path.&lt;/p>
&lt;p>This is a closed workflow. The user is trying to reorganize a room. That means most possible things they could say are actually out of scope.&lt;/p>
&lt;p>The happy path looks roughly like this:&lt;/p>
&lt;ol>
&lt;li>Discovery: what is the user trying to do?&lt;/li>
&lt;li>Solicitation: collect constraints, preferences, and photos.&lt;/li>
&lt;li>Construction: build the floor plan.&lt;/li>
&lt;li>Refinement: solve problems and swap furniture.&lt;/li>
&lt;/ol>
&lt;p>Everything else is basically exception handling.&lt;/p>
&lt;p>If the user asks about God, uploads a picture of a cat, or says “stop” halfway through a measurement flow, those are deviations from the intended flow and need explicit handling.&lt;/p>
&lt;p>This helped me a lot because you cannot really unit test the magic. But you can define the conversation that should contain it.&lt;/p>
&lt;p>You can define what “stop” means during measurement versus during shopping. You can say that if a photo is clearly not a room, the agent should reject it politely. In older software, I would have needed to build a dedicated detector for that. Here I can rely on the model to identify the image, and focus my specification on the behavior.&lt;/p>
&lt;p>A traditional spec might define the ImageError class. A conversational spec defines how the agent politely explains that a cat is not a bedroom.&lt;/p>
&lt;p>For these systems, that conversational rigor is the safety net.&lt;/p>
&lt;h2 id="summary">Summary&lt;/h2>
&lt;p>Building this agent taught me a few things.&lt;/p>
&lt;p>Framework choice matters, but less than I expected. Graphs are useful once the workflow is real, but they are not what makes the product feel intelligent.&lt;/p>
&lt;p>A basic chat UI matters much more than I expected, because it lets you debug the system in the same mode that the user will experience it.&lt;/p>
&lt;p>Prompt management matters because behavior regresses in subtle ways, and code correctness does not protect you from that.&lt;/p>
&lt;p>Multiple agents are attractive architecturally, but from the user’s point of view they risk becoming multiple personalities.&lt;/p>
&lt;p>And the hardest problem, really, is state.&lt;/p>
&lt;p>A web agent is not living in the file system with you. It is living in an application, with a database, a UI, and a conversation. The product only works if those things stay synchronized closely enough that the user feels the agent is living in the same reality they are.&lt;/p>
&lt;p>That is what trust is, in this kind of software.&lt;/p></description><author/><guid>https://talperry.com/en/posts/genai/engineering-agents-building-trust/</guid><pubDate>Mon, 09 Mar 2026 02:11:00 +0100</pubDate></item><item><title>The AI-Powered 10-Minute Habit That Taught My Kid to Read (And Made Me a Better Dad)</title><link>https://talperry.com/en/posts/genai/learning-to-read-with-ai/</link><description>&lt;p>&lt;span class="dropcap-wrap" data-german="Igel" data-english="Hedgehog" data-large="/letters/I_Igel_Hedgehog_hu1482128859696014127.webp">
&lt;img src="https://talperry.com/letters/I_Igel_Hedgehog_hu17110550747935687938.webp" alt="Letter I" class="dropcap" width="80" height="80" loading="lazy">
&lt;/span>built a system to teach my kid to read, using a free program called Anki for the &amp;ldquo;planning&amp;rdquo; and AI to make content that would lure him in.&lt;/p>
&lt;p>&lt;span class="dropcap-wrap" data-german="Iglu" data-english="Igloo" data-large="/letters/I_Iglu_Igloo_hu2009170655832268501.webp">
&lt;img src="https://talperry.com/letters/I_Iglu_Igloo_hu12057804526470791817.webp" alt="Letter I" class="dropcap" width="80" height="80" loading="lazy">
&lt;/span>t worked, he and I enjoyed it immensely, he reads fluently (though mechanically). Along the way I observed and learned a great deal about learning and teaching, young kids, or my young kid. That&amp;rsquo;s what I want to share with you today. In particular ,
I&amp;rsquo;d like to&lt;/p>
&lt;ol>
&lt;li>Give you a little background on the tech and science I used to do this (but just the basics)&lt;/li>
&lt;li>Share what I actually did, and my intuitions for why I did them&lt;/li>
&lt;li>Share what I learned in process.&lt;/li>
&lt;li>Point at how we as parents and teachers can use AI to teach kids.&lt;/li>
&lt;/ol>
&lt;h2 id="anki-and-spaced-repetition">Anki and spaced repetition&lt;/h2>
&lt;p>&lt;span class="dropcap-wrap" data-german="Affe" data-english="Monkey" data-large="/letters/A_Affe_Monkey_hu11989866935096095823.webp">
&lt;img src="https://talperry.com/letters/A_Affe_Monkey_hu17897185792508655664.webp" alt="Letter A" class="dropcap" width="80" height="80" loading="lazy">
&lt;/span>nki is a free program that implements &amp;ldquo;spaced repetition&amp;rdquo;, which is a technique to memorize things. Both are then based on the psychological &amp;ldquo;testing effect&amp;rdquo;, which is a concept in learning psychology that states that it&amp;rsquo;s easier to memorize something by trying to remember it as opposed to rereading it.&lt;/p>
&lt;p>&lt;span class="dropcap-wrap" data-german="Ameise" data-english="Ant" data-large="/letters/A_Ameise_Ant_hu1719830454569949506.webp">
&lt;img src="https://talperry.com/letters/A_Ameise_Ant_hu10470039641888831123.webp" alt="Letter A" class="dropcap" width="80" height="80" loading="lazy">
&lt;/span>nki (and similar programs) let you input what you want to learn, and have study sessions. Just like using flashcards to study. Anki&amp;rsquo;s killer feature is that you can score, with numbers, how well you remembered something (4=perfect, 1=not at all), and Anki will remember the statistics.&lt;/p>
&lt;p>&lt;span class="dropcap-wrap" data-german="Ananas" data-english="Pineapple" data-large="/letters/A_Ananas_Pineapple_hu708467253598580632.webp">
&lt;img src="https://talperry.com/letters/A_Ananas_Pineapple_hu8830242695221920886.webp" alt="Letter A" class="dropcap" width="80" height="80" loading="lazy">
&lt;/span>nki then uses fancy math (an algorithm) to calculate what you should study tomorrow, to maximize the amount of learning you get for the time you invest. Or to minimize the amount of time you need to spend to learn something.&lt;/p>
&lt;h3 id="getting-a-kids-attention">Getting a kid&amp;rsquo;s attention&lt;/h3>
&lt;p>&lt;span class="dropcap-wrap" data-german="Maler" data-english="Painter" data-large="/letters/M_Maler_Painter_hu10325223952203934668.webp">
&lt;img src="https://talperry.com/letters/M_Maler_Painter_hu3950468113467810691.webp" alt="Letter M" class="dropcap" width="80" height="80" loading="lazy">
&lt;/span>y son was about four years old when we started. He did not care about spaced repetition, or compounding effects of learning, or daddy&amp;rsquo;s fancy algorithms. He did like colourful pictures and cuddles.&lt;/p>
&lt;p>&lt;span class="dropcap-wrap" data-german="Wald" data-english="Forest" data-large="/letters/W_Wald_Forest_hu16429378600325742329.webp">
&lt;img src="https://talperry.com/letters/W_Wald_Forest_hu2997903226893313084.webp" alt="Letter W" class="dropcap" width="80" height="80" loading="lazy">
&lt;/span>e had some wall charts with colorful letters and things that started with them, but they were kind of tame and they certainly didn&amp;rsquo;t get much of his mental real estate. I had a sense that if I could make things that were weird and delightful and surprising, he&amp;rsquo;d let them in or at least pay attention.&lt;/p>
&lt;p>&lt;span class="dropcap-wrap" data-german="Igel" data-english="Hedgehog" data-large="/letters/I_Igel_Hedgehog_hu1482128859696014127.webp">
&lt;img src="https://talperry.com/letters/I_Igel_Hedgehog_hu17110550747935687938.webp" alt="Letter I" class="dropcap" width="80" height="80" loading="lazy">
&lt;/span>&amp;rsquo;ve always been good with weird and surprising, but am on a good day aesthetically anemic and could never visualize my ideas in a palatable way, much less do hundreds of variants that all looked great. That&amp;rsquo;s where AI really stepped in here, for each of the 26 letters I got enough variants to throw out the useless ones and keep the best ones for my boy.&lt;/p>
&lt;div class="img-row">
&lt;picture>
&lt;source srcset="https://talperry.com/letters/I_Insekten_Insects_hu5307193560213127063.webp" type="image/webp">
&lt;source srcset="https://talperry.com/letters/I_Insekten_Insects_hu5307193560213127063.webp" type="image/webp">
&lt;img src="https://talperry.com/letters/I_Insekten_Insects_hu5307193560213127063.webp" alt="I is for Insects" loading="lazy">
&lt;/picture>
&lt;picture>
&lt;source srcset="https://talperry.com/letters/I_Iglu_Igloo_hu2653106235828540449.webp" type="image/webp">
&lt;source srcset="https://talperry.com/letters/I_Iglu_Igloo_hu2653106235828540449.webp" type="image/webp">
&lt;img src="https://talperry.com/letters/I_Iglu_Igloo_hu2653106235828540449.webp" alt="I is for Igloo" loading="lazy">
&lt;/picture>
&lt;picture>
&lt;source srcset="https://talperry.com/letters/I_Insel_Island_hu8970070859097694239.webp" type="image/webp">
&lt;source srcset="https://talperry.com/letters/I_Insel_Island_hu8970070859097694239.webp" type="image/webp">
&lt;img src="https://talperry.com/letters/I_Insel_Island_hu8970070859097694239.webp" alt="I is for Island" loading="lazy">
&lt;/picture>
&lt;/div>
&lt;h2 id="emotional-anesthesia-building-confidence-through-memorization">Emotional Anesthesia: Building Confidence Through Memorization&lt;/h2>
&lt;p>&lt;span class="dropcap-wrap" data-german="Insekten" data-english="Insects" data-large="/letters/I_Insekten_Insects_hu14949999140754116231.webp">
&lt;img src="https://talperry.com/letters/I_Insekten_Insects_hu1841915600184092740.webp" alt="Letter I" class="dropcap" width="80" height="80" loading="lazy">
&lt;/span> think memorization is awesome—especially for kids. Memorization is like anesthesia for insecurity: you can’t think “I can’t” if you don’t even get the chance to think, because you already memorized whatever it is you were going to think about.&lt;/p>
&lt;p>&lt;span class="dropcap-wrap" data-german="Iglu" data-english="Igloo" data-large="/letters/I_Iglu_Igloo_hu2009170655832268501.webp">
&lt;img src="https://talperry.com/letters/I_Iglu_Igloo_hu12057804526470791817.webp" alt="Letter I" class="dropcap" width="80" height="80" loading="lazy">
&lt;/span> think kids (and adults) are naturally curious and want to learn and succeed. But then we beat into them that they&amp;rsquo;re lazy, or dumb. Someone laughs at them. Dad loses patience. The teacher scolds them for not paying attention. And now it&amp;rsquo;s: &amp;ldquo;I can&amp;rsquo;t.&amp;rdquo;&lt;/p>
&lt;p>&lt;span class="dropcap-wrap" data-german="Wasser" data-english="Water" data-large="/letters/W_Wasser_Water_hu716086202171572872.webp">
&lt;img src="https://talperry.com/letters/W_Wasser_Water_hu14611659763967967197.webp" alt="Letter W" class="dropcap" width="80" height="80" loading="lazy">
&lt;/span>ho wants to fight with their kid even more? That&amp;rsquo;s exhausting—and sad. So instead of fighting them and their self-limiting beliefs, let&amp;rsquo;s just skip over that whole part of the brain that&amp;rsquo;s making them feel &amp;ldquo;I can&amp;rsquo;t.&amp;rdquo;&lt;/p>
&lt;p>&lt;span class="dropcap-wrap" data-german="Hahn" data-english="Rooster" data-large="/letters/H_Hahn_Rooster_hu6015066235283230222.webp">
&lt;img src="https://talperry.com/letters/H_Hahn_Rooster_hu7700958604819589043.webp" alt="Letter H" class="dropcap" width="80" height="80" loading="lazy">
&lt;/span>ow? With memorization. Memorization is targeted emotional anesthesia. We memorize things—small things—and then, when we need them, we don&amp;rsquo;t even notice they&amp;rsquo;re there because they&amp;rsquo;re already memorized. The brain pulls up what it knows before there&amp;rsquo;s time to feel anything—certainly before the body tenses and the &amp;ldquo;I can&amp;rsquo;t&amp;rdquo; shows up. (This apparently follows from Sweller&amp;rsquo;s Cognitive Load Theory, so I&amp;rsquo;m not just ranting.)&lt;/p>
&lt;p>&lt;span class="dropcap-wrap" data-german="Farbe" data-english="Color" data-large="/letters/F_Farbe_Color_hu44470047475286189.webp">
&lt;img src="https://talperry.com/letters/F_Farbe_Color_hu15687849831792248789.webp" alt="Letter F" class="dropcap" width="80" height="80" loading="lazy">
&lt;/span>irst letter forms, then diphthong sounds, then whole words. Sentences stagger, then flow, and one day they&amp;rsquo;re reading The Anarchist Cookbook and learning to make high explosives in Mom&amp;rsquo;s favorite pot. Sorry, Mom.&lt;/p>
&lt;div class="img-row">
&lt;picture>
&lt;source srcset="https://talperry.com/en/posts/genai/learning-to-read-with-ai/learning_progress_hu12522524810842550186.webp" type="image/webp">
&lt;source srcset="https://talperry.com/en/posts/genai/learning-to-read-with-ai/learning_progress_hu12522524810842550186.webp" type="image/webp">
&lt;img src="https://talperry.com/en/posts/genai/learning-to-read-with-ai/learning_progress_hu12522524810842550186.webp" alt="Example Anki card screenshot" loading="lazy">
&lt;/picture>
&lt;/div>
&lt;hr>
&lt;h2 id="the-emotional-turn-the-real-discovery">The Emotional Turn (The Real Discovery)&lt;/h2>
&lt;p>&lt;span class="dropcap-wrap" data-german="Salz" data-english="Salt" data-large="/letters/S_Salz_Salt_hu10216670104941513018.webp">
&lt;img src="https://talperry.com/letters/S_Salz_Salt_hu15086118224761705563.webp" alt="Letter S" class="dropcap" width="80" height="80" loading="lazy">
&lt;/span>o far I&amp;rsquo;ve told you theory, my thoughts on memorization, what spaced repetition is, that Anki is a program that does spaced repetition, and that I made engaging pictures with AI so that we could learn letters. But what did this actually look like?&lt;/p>
&lt;p>&lt;span class="dropcap-wrap" data-german="Insekten" data-english="Insects" data-large="/letters/I_Insekten_Insects_hu14949999140754116231.webp">
&lt;img src="https://talperry.com/letters/I_Insekten_Insects_hu1841915600184092740.webp" alt="Letter I" class="dropcap" width="80" height="80" loading="lazy">
&lt;/span> put those pictures I made in Anki, each day Anki would pick a few new ones, and a few that we needed to review according to its algorithm. Then we&amp;rsquo;d look at a picture, I&amp;rsquo;d ask my son &amp;ldquo;what is that&amp;rdquo; and he&amp;rsquo;d say &amp;ldquo;Igloo&amp;rdquo; or &amp;ldquo;Dog&amp;rdquo; or &amp;ldquo;Worm&amp;rdquo;, then I&amp;rsquo;d ask him what letter it was and he&amp;rsquo;d say &amp;ldquo;I&amp;rdquo; or &amp;ldquo;D&amp;rdquo; or &amp;ldquo;I don&amp;rsquo;t know&amp;rdquo; or squirm around and shriek.&lt;/p>
&lt;p>&lt;span class="dropcap-wrap" data-german="Insel" data-english="Island" data-large="/letters/I_Insel_Island_hu13363896415183423027.webp">
&lt;img src="https://talperry.com/letters/I_Insel_Island_hu11839928786208151714.webp" alt="Letter I" class="dropcap" width="80" height="80" loading="lazy">
&lt;/span>f he didn&amp;rsquo;t know at all I&amp;rsquo;d press 1, if he knew it perfectly I&amp;rsquo;d press 4 and 2 or 3 for the stuff in between. Anki would put on the next item until we got through them. Sessions would last 5 to 10 minutes tops, and when they did go over I&amp;rsquo;d stop them and configure Anki to do less work or more easy stuff.&lt;/p>
&lt;p>&lt;span class="dropcap-wrap" data-german="Wal" data-english="Whale" data-large="/letters/W_Wal_Whale_hu7388934065381436548.webp">
&lt;img src="https://talperry.com/letters/W_Wal_Whale_hu17461669685176804783.webp" alt="Letter W" class="dropcap" width="80" height="80" loading="lazy">
&lt;/span>e did it every day, my son and I. &amp;ldquo;Bo na&amp;rsquo;aseh Buchstaben,&amp;rdquo; he&amp;rsquo;d say. Half Hebrew, half German. &amp;ldquo;Let&amp;rsquo;s do letters.&amp;rdquo; I&amp;rsquo;d sit on the orange beanbag in his room, with Anki on my laptop. Sometimes he&amp;rsquo;d sit on my lap right away, sometimes he&amp;rsquo;d hang naked off his bunkbed, his ballsack swinging in front of my face in the cool Berlin morning air.&lt;/p>
&lt;p>&lt;span class="dropcap-wrap" data-german="Instrumente" data-english="Instruments" data-large="/letters/I_Instrumente_Instruments_hu9014410493997790894.webp">
&lt;img src="https://talperry.com/letters/I_Instrumente_Instruments_hu9751480735879318599.webp" alt="Letter I" class="dropcap" width="80" height="80" loading="lazy">
&lt;/span> don&amp;rsquo;t know why, but for the ten minutes a day we&amp;rsquo;d learn together, the holy spirit would possess me and I was filled with infinite patience. No matter what my son did — squirm, jump, hang naked off his bed dangling his balls in my face — I just let him be, waited, hugged him and gently nodded him back when he was ready. It was our comfy, intimate, happy time together. It almost never happened that I got frustrated in that time.&lt;/p>
&lt;picture>
&lt;source srcset="https://talperry.com/en/posts/genai/learning-to-read-with-ai/session_hu5654516879113692651.webp" type="image/webp">
&lt;source srcset="https://talperry.com/en/posts/genai/learning-to-read-with-ai/session_hu5654516879113692651.webp" type="image/webp">
&lt;img src="https://talperry.com/en/posts/genai/learning-to-read-with-ai/session_hu5654516879113692651.webp" alt="Anki session snapshot" class="article-image" loading="lazy">
&lt;/picture>
&lt;h2 id="shifting-the-optimization-target-from-memory-to-affirmation">Shifting the Optimization Target: From Memory to Affirmation&lt;/h2>
&lt;p>&lt;span class="dropcap-wrap" data-german="Kamel" data-english="Camel" data-large="/letters/K_Kamel_Camel_hu3335982631660133316.webp">
&lt;img src="https://talperry.com/letters/K_Kamel_Camel_hu7235405561059269113.webp" alt="Letter K" class="dropcap" width="80" height="80" loading="lazy">
&lt;/span>ids aren&amp;rsquo;t workers, we&amp;rsquo;re not trying to optimize their learning productivity (the school&amp;rsquo;s productivity is another matter). One day I realized how uncharacteristically patient I was being, and how enabling and nice that was for my son.&lt;/p>
&lt;p>&lt;span class="dropcap-wrap" data-german="Hand" data-english="Hand" data-large="/letters/H_Hand_Hand_hu9779494230232409972.webp">
&lt;img src="https://talperry.com/letters/H_Hand_Hand_hu5328980168084718685.webp" alt="Letter H" class="dropcap" width="80" height="80" loading="lazy">
&lt;/span>e may have outgrown the humor of my funny pictures, but he kept coming back for the warmth and intimacy, and the validation of getting it right. And I thought: I don&amp;rsquo;t need to optimize for memorization, which is what Anki tries to do. I can skew the system for fun instead. I had no deadline. No exam. We were doing this for our own pleasure.&lt;/p>
&lt;p>&lt;span class="dropcap-wrap" data-german="Anzug" data-english="Suit" data-large="/letters/A_Anzug_Suit_hu7013813314008038013.webp">
&lt;img src="https://talperry.com/letters/A_Anzug_Suit_hu13706605758447164449.webp" alt="Letter A" class="dropcap" width="80" height="80" loading="lazy">
&lt;/span>nd so I did, I set up Anki to mostly show us material my son knew very well. I made trivially easy cards to study, &lt;code>1+1=?&lt;/code> and then &lt;code>Eins plus Eins ist?&lt;/code>, so that he might see he has 20 units of work, and plough through 18 of them in 2 minutes and feel like a genius. That only 2 of 20 cards had new information, or &amp;ldquo;learning value&amp;rdquo; was fine, is fine. For our purposes, we were optimising retention and joy, so that we reinforce the habit and the feeling that &amp;ldquo;I can do it&amp;rdquo;.&lt;/p>
&lt;p>&lt;span class="dropcap-wrap" data-german="Oboe" data-english="Oboe" data-large="/letters/O_Oboe_Oboe_hu6518843103298836047.webp">
&lt;img src="https://talperry.com/letters/O_Oboe_Oboe_hu14449703266525210118.webp" alt="Letter O" class="dropcap" width="80" height="80" loading="lazy">
&lt;/span>ver time, not only did my son learn to read (and also got very good at arithmetic), he learned about learning. He&amp;rsquo;d struggle reading a sentence and get upset, and I&amp;rsquo;d remind him that a few months ago he couldn&amp;rsquo;t recognise most of the letters, and that he did the work, so that now he can read.
He accepted that, and lives with the knowledge that if he puts in the work he will get the reward, which I think is the bigger educational achievement even than learning to read.&lt;/p>
&lt;h2 id="recap-and-future">Recap and future&lt;/h2>
&lt;p>&lt;span class="dropcap-wrap" data-german="Insel" data-english="Island" data-large="/letters/I_Insel_Island_hu13363896415183423027.webp">
&lt;img src="https://talperry.com/letters/I_Insel_Island_hu11839928786208151714.webp" alt="Letter I" class="dropcap" width="80" height="80" loading="lazy">
&lt;/span> feel immense satisfaction with this whole project. I love that my son can read, that &amp;ldquo;it worked&amp;rdquo;, the emotional bond we built. I value the learning skills he picked up, and the learning skills I learned from learning about learning.&lt;/p>
&lt;p>&lt;span class="dropcap-wrap" data-german="Lampe" data-english="Lamp" data-large="/letters/L_Lampe_Lamp_hu9086089164003164307.webp">
&lt;img src="https://talperry.com/letters/L_Lampe_Lamp_hu9682591275743805447.webp" alt="Letter L" class="dropcap" width="80" height="80" loading="lazy">
&lt;/span>ooking back, seeing my son in school with his reading skills and confidence in his ability to learn, I think:&lt;/p>
&lt;ul>
&lt;li>Making the AI pictures was fun for me, and super engaging for my son — it did open the door&lt;/li>
&lt;li>Having a few very different versions of each letter (Apple, Ape, &amp;hellip;) helped him not &amp;ldquo;overfit&amp;rdquo; and actually learn the letters&lt;/li>
&lt;li>The emotional &amp;ldquo;peace&amp;rdquo; and my own patience were critical for making this successful. My son was drawn to and will remember the cuddles, not the act of learning.&lt;/li>
&lt;li>The framework of spaced repetition was a good starting plan for curriculum planning, and was enriched by:
&lt;ul>
&lt;li>Keeping things short, under 10 minutes per learning session&lt;/li>
&lt;li>Eventually optimising for &amp;ldquo;winning&amp;rdquo; and building confidence, even at the cost of less learning per session&lt;/li>
&lt;li>Adding trivial learning cards, just to reinforce that winning&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;hr>
&lt;p>&lt;span class="dropcap-wrap" data-german="Instrumente" data-english="Instruments" data-large="/letters/I_Instrumente_Instruments_hu9014410493997790894.webp">
&lt;img src="https://talperry.com/letters/I_Instrumente_Instruments_hu9751480735879318599.webp" alt="Letter I" class="dropcap" width="80" height="80" loading="lazy">
&lt;/span> did this project in 2023-2024, my son was in kindergarten, and I was excited to bring this into the classroom, but could not figure out a way. The teachers wouldn&amp;rsquo;t be able to manage multi-user sessions of Anki, teaching each child how to use it, etc. And the kindergarten management were also not amenable to innovation.&lt;/p>
&lt;p>&lt;span class="dropcap-wrap" data-german="Teddybär" data-english="Teddy Bear" data-large="/letters/T_Teddyb%C3%A4r_Teddy%20Bear_hu9508806944643113737.webp">
&lt;img src="https://talperry.com/letters/T_Teddyb%C3%A4r_Teddy%20Bear_hu12041565987269471924.webp" alt="Letter T" class="dropcap" width="80" height="80" loading="lazy">
&lt;/span>ech has evolved so fast in the last two years, that I think getting this into classrooms is or will soon be feasible. The two blockers were usability, independent usability by children, and sufficient reliability and maturity so that teachers are freed and empowered, instead of required to fill in a support role.&lt;/p>
&lt;p>&lt;span class="dropcap-wrap" data-german="Telefon" data-english="Phone" data-large="/letters/T_Telefon_Phone_hu1983613764125235301.webp">
&lt;img src="https://talperry.com/letters/T_Telefon_Phone_hu18019848682305873414.webp" alt="Letter T" class="dropcap" width="80" height="80" loading="lazy">
&lt;/span>he other blocker is regulatory and cultural, the willingness of the regulator and parents to allow their children&amp;rsquo;s learning statistics and behavioural patterns to go to some company&amp;rsquo;s cloud. Perhaps in the U.S., but this seems unimaginable in Germany.&lt;/p>
&lt;p>&lt;span class="dropcap-wrap" data-german="Ball" data-english="Ball" data-large="/letters/B_Ball_Ball_hu9734392891948123528.webp">
&lt;img src="https://talperry.com/letters/B_Ball_Ball_hu1105597628676062714.webp" alt="Letter B" class="dropcap" width="80" height="80" loading="lazy">
&lt;/span>ut the models are getting smaller, and small enough that we can literally &amp;ldquo;put them in a box&amp;rdquo; that doesn&amp;rsquo;t connect to the internet. And yet, as they get smaller, they are smarter, and it&amp;rsquo;s easy to imagine a voice interface that kids can use, that can manage auth and be set onsite, and simple agentic workflows that unstick a child that gets off the path. Maybe not today, but I think, soon, and it&amp;rsquo;s exciting.&lt;/p></description><author/><guid>https://talperry.com/en/posts/genai/learning-to-read-with-ai/</guid><pubDate>Mon, 02 Feb 2026 09:00:00 +0100</pubDate></item><item><title>Five Practical Lessons for Serving Models with Triton Inference Server</title><link>https://talperry.com/en/posts/genai/triton-inference-server/</link><description>&lt;p>Triton Inference Server has become a popular choice for production model serving, and for good reason: it is fast, flexible, and powerful. That said, using Triton effectively requires understanding where it shines—and where it very much does not. This post collects five practical lessons from running Triton in production that I wish I had internalized earlier.&lt;/p>
&lt;h2 id="choose-the-right-serving-layer">Choose the Right Serving Layer&lt;/h2>
&lt;p>Not all models belong on Triton. &lt;strong>Use vLLM for generative models; use Triton for more traditional inference workloads.&lt;/strong>&lt;/p>
&lt;p>LLMs are everywhere right now, and Triton offers integrations with both TensorRT-LLM and vLLM. At first glance, this makes Triton look like a one-stop shop for serving everything from image classifiers to large language models.&lt;/p>
&lt;p>In practice, I’ve found that Triton adds very little on top of a “raw” vLLM deployment. That’s not a knock on Triton—it’s a reflection of how different generative workloads are from classical inference. Many of Triton’s best features simply don’t map cleanly to the way LLMs are served.&lt;/p>
&lt;p>A few concrete examples make this clear:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Dynamic batching → Continuous batching&lt;/strong>
Triton’s dynamic batcher waits briefly to group whole requests and then executes them together. This works extremely well for fixed-shape inference. LLM serving, on the other hand, benefits from continuous batching, where new requests are inserted into an active batch as others finish generating tokens. While this is technically possible through Triton’s vLLM backend, it is neither simple nor obvious to operate.&lt;/li>
&lt;/ul>
&lt;picture>
&lt;source srcset="https://talperry.com/en/posts/genai/triton-inference-server/dynamic-vs-continuous-batching_hu5809796310521240948.webp" type="image/webp">
&lt;source srcset="https://talperry.com/en/posts/genai/triton-inference-server/dynamic-vs-continuous-batching_hu6424849289932300575.png" type="image/png">
&lt;img src="https://talperry.com/en/posts/genai/triton-inference-server/dynamic-vs-continuous-batching_hu6424849289932300575.png" alt="Dynamic batching vs continuous batching" class="article-image" loading="lazy">
&lt;/picture>
&lt;ul>
&lt;li>&lt;strong>Model packing → Model sharding&lt;/strong>
Triton makes it easy to pack multiple models onto a single GPU to improve utilization. LLMs rarely fit this model. Even modest models tend to consume an entire GPU, and larger ones require sharding across GPUs or even nodes. Triton doesn’t prevent this, but it also doesn’t meaningfully help.&lt;/li>
&lt;/ul>
&lt;picture>
&lt;source srcset="https://talperry.com/en/posts/genai/triton-inference-server/model-sharding-vs-packing_hu16157470664919221625.webp" type="image/webp">
&lt;source srcset="https://talperry.com/en/posts/genai/triton-inference-server/model-sharding-vs-packing_hu11555892844646172857.png" type="image/png">
&lt;img src="https://talperry.com/en/posts/genai/triton-inference-server/model-sharding-vs-packing_hu11555892844646172857.png" alt="Model sharding vs model packing" class="article-image" loading="lazy">
&lt;/picture>
&lt;ul>
&lt;li>&lt;strong>Request caching → Prefix caching&lt;/strong>
Triton’s built-in cache works by storing request–response pairs, which is very effective for deterministic workloads. Generative models instead benefit from caching intermediate state, such as KV caches keyed by shared prompt prefixes. This is a fundamentally different problem and one that LLM-native serving systems handle far more naturally.&lt;/li>
&lt;/ul>
&lt;p>In short, I’ve consistently found it dramatically simpler to deploy vLLM directly and immediately benefit from continuous batching, sharding, and prefix caching than to layer Triton on top and wrestle with configuration to achieve similar behavior.&lt;/p>
&lt;h2 id="protect-latency-with-server-side-timeouts">Protect Latency with Server-Side Timeouts&lt;/h2>
&lt;p>Dynamic batching is Triton’s killer feature. By buffering requests for a short, configurable window and executing them in batch, Triton improves hardware utilization and eliminates a large amount of client-side complexity.&lt;/p>
&lt;p>There is, however, an important footgun: by default, Triton will not evict queued requests.&lt;/p>
&lt;p>Under load, it is entirely possible for Triton to accumulate a backlog while clients time out and move on. If &lt;code>max_queue_delay_microseconds&lt;/code> is not configured, those abandoned requests can sit in the queue and eventually execute, consuming resources while newer requests wait their turn.&lt;/p>
&lt;p>The result is perverse but common:&lt;/p>
&lt;ul>
&lt;li>Triton spends time processing requests the client has already given up on.&lt;/li>
&lt;li>Latency increases as the queue drains stale work.&lt;/li>
&lt;/ul>
&lt;p>This problem is especially acute when using the Python backend. While some native backends can detect client cancellation, the Python backend largely leaves this responsibility to user code. Once a request reaches your &lt;code>execute()&lt;/code> method, it will usually run to completion unless you explicitly check for cancellation.&lt;/p>
&lt;p>If you care about latency—and you almost certainly do—server-side queue timeouts are not optional.&lt;/p>
&lt;h2 id="keep-client-libraries-minimal">Keep Client Libraries Minimal&lt;/h2>
&lt;p>Triton requires clients to know model names, tensor names, shapes, and data types. Exposing this directly to application developers is unpleasant, so providing a small client wrapper is usually worth it.&lt;/p>
&lt;p>Where things go wrong is when that wrapper grows ambitions.&lt;/p>
&lt;p>I’ve seen (and built) client libraries that try to be helpful by adding retries, backoff, or other resilience features. In practice, this often backfires. Retrying requests that failed due to overload or invalid inputs can amplify traffic precisely when the system is already struggling, turning a transient slowdown into a self-inflicted denial-of-service.&lt;/p>
&lt;p>Which is not to say don&amp;rsquo;t use retries, but rather don&amp;rsquo;t make them invisible, and allow callers to identify and be identified when retry logic needs to be revistied.&lt;/p>
&lt;p>My recommendation is simple: keep client libraries boring. Let them handle request construction and nothing more. Implement retries and error handling at the call site, where the application has the necessary context and observability to do the right thing.&lt;/p>
&lt;h2 id="leverage-tritons-built-in-cache">Leverage Triton’s Built-in Cache&lt;/h2>
&lt;p>Triton’s request–response cache is easy to overlook, but it can be surprisingly effective, especially in cloud environments. GPU instances often come with far more system memory than is otherwise used, and allocating a few extra gigabytes to caching can spare your GPU a significant amount of redundant work.&lt;/p>
&lt;p>This is not a blanket recommendation—many workloads won’t benefit—but it is worth experimenting. Watching cache hit rates alongside queue depth can quickly tell you whether caching is helping and whether a particular client is generating unnecessary duplicate traffic.&lt;/p>
&lt;h2 id="prefer-threadpoolexecutor-for-client-side-parallelism">Prefer ThreadPoolExecutor for Client-Side Parallelism&lt;/h2>
&lt;p>On the client side, I’ve found that the simplest way to issue parallel inference requests is also the best one: use a thread pool.&lt;/p>
&lt;p>In CPython, socket I/O releases the GIL. Since Triton’s HTTP client is primarily I/O-bound, this makes &lt;code>ThreadPoolExecutor&lt;/code> an effective and straightforward choice:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">infer&lt;/span>(inputs):
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">return&lt;/span> model_client&lt;span style="color:#f92672">.&lt;/span>infer(inputs&lt;span style="color:#f92672">=&lt;/span>inputs)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">with&lt;/span> ThreadPoolExecutor(max_workers&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#ae81ff">8&lt;/span>) &lt;span style="color:#66d9ef">as&lt;/span> pool:
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> results &lt;span style="color:#f92672">=&lt;/span> list(pool&lt;span style="color:#f92672">.&lt;/span>map(infer, batch_of_requests))
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>This approach has a few nice properties:&lt;/p>
&lt;ol>
&lt;li>The client does not need to implement batching logic.&lt;/li>
&lt;li>Triton’s dynamic batcher can aggregate requests across threads and even across clients.&lt;/li>
&lt;li>Concurrency is naturally bounded, providing a form of backpressure.&lt;/li>
&lt;/ol>
&lt;p>Any Python work inside &lt;code>infer&lt;/code> remains serialized, which turns out to be a feature rather than a bug: it prevents the client from overwhelming the server while still allowing efficient parallel I/O.&lt;/p>
&lt;h2 id="conclusion">Conclusion&lt;/h2>
&lt;p>Triton is a powerful serving system, but it is also opinionated. It works best when its abstractions line up with the workload you are trying to serve.&lt;/p>
&lt;p>For classical inference workloads, Triton’s batching, scheduling, and caching are hard to beat. For LLMs and other generative models, purpose-built systems like vLLM tend to be a better fit. Understanding this distinction—and configuring Triton defensively when you do use it—goes a long way toward building reliable, low-latency inference systems.&lt;/p></description><author/><guid>https://talperry.com/en/posts/genai/triton-inference-server/</guid><pubDate>Mon, 15 Dec 2025 10:00:00 +0200</pubDate></item><item><title>I’m Not the Founder This App Deserves</title><link>https://talperry.com/en/posts/scripture-app/</link><description>&lt;p>Before diving into the reasons behind my decision, it&amp;rsquo;s essential to know that I am a Jewish Israeli atheist living in Berlin. This background might make you wonder why I would even consider building such an app.&lt;/p>
&lt;p>Despite my core identity, I sold a &lt;a href="https://lighttag.io">developer tools company&lt;/a> two years ago and vowed, &amp;ldquo;Never again to build a developer tools company.&amp;rdquo; Instead, I want to pursue something with a well-defined target market and clear value proposition, ideally requiring no outside capital.&lt;/p>
&lt;p>Memorizing Christian scripture, a niche within the Faithtech market, initially appeared promising. However, I ultimately concluded that it wasn’t the right fit for me. Here I’ll reflect on how I found the idea to begin with and how I concluded I am not a fit for it.&lt;/p>
&lt;h2 id="initial-motivation">Initial Motivation&lt;/h2>
&lt;p>As an immigrant in Germany, learning the local language has been a persistent challenge. I have been using &lt;a href="https://apps.ankiweb.net/">Anki&lt;/a>, a tool that employs “spaced repetition,” to expand my German vocabulary.&lt;/p>
&lt;p>Discovering Anki&amp;rsquo;s effectiveness, I came across a heartwarming &lt;a href="https://www.reddit.com/r/Anki/comments/eisra4/update_on_my_daughter_and_anki/">Reddit story&lt;/a> of a parent teaching their child to read using this tool. Inspired, I successfully taught my own five-year-old son to read with Anki.
&lt;img
src="./giraffe.jpeg"
alt="GenAI makes a fun way to read the word Giraffe "
loading="lazy"
decoding="async"
class="full-width"
/>
Moreover, I discovered that GenAI allows me to generate large volumes of high-quality content affordably, which would have been prohibitively expensive a few years ago. Using GenAI, I created engaging educational content for my son, like spelling the word “Wurst” (sausage) using images of sausages and producing illustrated and narrated German sentences in YouTube videos.&lt;/p>
&lt;p>There is something beautiful about people wanting to internalize the words that shape them. I was intrigued by the possibility of selling this as a product to other parents. However, I realized that the market for educational apps teaching children to read is unappealing. The price point is low, customer acquisition costs are high, regulations are complex, and subscription revenue is challenging to achieve.&lt;/p>
&lt;p>Despite these hurdles, I remained interested in the intersection of affordable high-quality content that GenAI enables and memorization algorithms. However, having previously made the mistake of building something and then validating whether someone wanted it, I now sought a problem to solve before developing a solution.&lt;/p>
&lt;p>One day, driven by curiosity, I embarked on an endeavor to memorize several chapters of the Old Testament in Hebrew using Anki. Although this could be a product, its Jewish-specific focus limits the potential market due to the smaller global Jewish population.&lt;/p>
&lt;p>In contrast, there are many Christians in the U.S. with smartphones and relatively high disposable income. This could be a viable market, so I began exploring scripture memorization for Christians.&lt;/p>
&lt;p>I also had to admit that while the market was large and legible, it wasn’t my story to tell, nor an audience I could intuit from lived experience.&lt;/p>
&lt;h2 id="the-deep-dive">The Deep Dive&lt;/h2>
&lt;p>A few quick Google searches revealed that there are about 200 million Christians in the U.S., with 140 million identifying as evangelicals. While I didn&amp;rsquo;t fully grasp the significance of this, I knew from social media that evangelicals are devout and willing to invest in their spirituality.&lt;/p>
&lt;p>This idea became more appealing when I discovered the wealth of data available about the prospective market. In contrast to my experience with developer tools, where market segmentation was a challenge, here I found detailed &lt;a href="https://www.pewresearch.org/religion/2023/06/02/use-of-apps-and-websites-in-religious-life/">Pew Research data&lt;/a> on app usage among different denominations, disposable income, and geographic distribution.&lt;/p>
&lt;p>With this data, I could effectively target specific market segments, tailoring language, imagery, and marketing strategies accordingly. I became convinced that if people were willing to pay for this solution, I could design effective marketing experiments to scale a sales machine.&lt;/p>
&lt;h2 id="the-product-development-hurdle">The Product Development Hurdle&lt;/h2>
&lt;p>While scalable marketing is promising, a marketing campaign needs a functioning product to bring to market. What does “functioning&amp;rsquo;&amp;rsquo; mean in this context? For users, it means the app helps them memorize scripture.&lt;/p>
&lt;p>However for me, the person who will be investing time and money into building this, a functioning app means an app that converts users into paying customers and retains them.&lt;/p>
&lt;p>Viewing a product as a revenue-generating machine complicates the scope of an MVP. It involves appropriate microcopy, correct pricing, delivering a quick “Wow!” moment, and ensuring user retention.&lt;/p>
&lt;p>While feasible, it sounds challenging, expensive, and time-consuming. I asked myself a few questions: Could I achieve this without venture capital funding? Probably not. Do I have expertise in creating consumer apps that convert? No. Do I have insights into making the app viral? No.&lt;/p>
&lt;p>My enthusiasm waned, and a realization during a conversation with my wife sealed the decision.&lt;/p>
&lt;h2 id="the-marketing-challenge">The Marketing Challenge&lt;/h2>
&lt;p>While discussing the idea with my wife in traffic, Maria was singing along to &lt;a href="https://genius.com/Carrie-underwood-before-he-cheats-lyrics">Carrie Underwood’s “Before He Cheats”&lt;/a>, where she captures a whole universe with:&lt;/p>
&lt;blockquote>
&lt;p>“Right now, he&amp;rsquo;s probably buying her some fruity little drink &amp;lsquo;Cause she can&amp;rsquo;t shoot a whiskey,”&lt;/p>
&lt;/blockquote>
&lt;picture>
&lt;source srcset="https://talperry.com/en/posts/scripture-app/before-he-cheats_hu17757550046027709510.webp" type="image/webp">
&lt;source srcset="https://talperry.com/en/posts/scripture-app/before-he-cheats_hu17757550046027709510.webp" type="image/webp">
&lt;img src="https://talperry.com/en/posts/scripture-app/before-he-cheats_hu17757550046027709510.webp" alt="Carrie Underwood — Before He Cheats" class="article-image" loading="lazy">
&lt;/picture>
&lt;p>It highlighted the songwriter’s deep understanding of their audience. Shooting whiskey is an evocative phrase for their audience, but relatively meaningless to me (An Israeli in Berlin where whiskey isn’t a cultural staple). The songwriters knew their audience so well they could intuit evocative phrases like that.&lt;/p>
&lt;p>If I were to sell scripture memorization software to American Christians, what could I intuit about them? What relevance or advantage do I have in creating a product that touches on an identity I don’t share?&lt;/p>
&lt;p>I realized I wasn’t just missing the marketing language—I was missing the lived context that shapes why scripture memorization matters in the first place.&lt;/p>
&lt;p>This issue can be addressed with money. I could hire an agency that specializes in the Christian segment. But without a market-ready product, why invest in marketing? And without a clear marketing strategy, why build the product?&lt;/p>
&lt;h2 id="personal-fit-and-market-understanding">Personal Fit and Market Understanding&lt;/h2>
&lt;p>Both product and marketing challenges can be solved with time and money. But I had to ask myself, how much time? How much of my life am I willing to dedicate to building and selling scripture memorization software?&lt;/p>
&lt;p>Yes, I would like to help people deepen their spirituality. Yes, it would be intellectually stimulating. Yes, it could be lucrative. But I have no personal connection to the product or community. Is this how I want to spend the next 5-10 years of my life?&lt;/p>
&lt;p>No, it’s not.&lt;/p>
&lt;p>The question wasn’t whether I could build it—but why I would. There are many good problems, but not all of them are mine.&lt;/p>
&lt;h2 id="conclusion">Conclusion&lt;/h2>
&lt;p>I was initially excited about this opportunity because it leveraged familiar technology, had a large and well-defined market, and seemed potentially lucrative. However, I realized that without a personal advantage in this space, the cost (in time and money) to develop even an MVP was more than I was willing to invest.&lt;/p>
&lt;p>Exploration of a market became exploration of identity.&lt;/p></description><author/><guid>https://talperry.com/en/posts/scripture-app/</guid><pubDate>Tue, 14 May 2024 10:07:22 +0200</pubDate></item><item><title>Convolutional Methods for Text</title><link>https://talperry.com/en/posts/classics/cmft/</link><description>&lt;h3 id="tldr">tl;dr&lt;/h3>
&lt;ul>
&lt;li>RNNS work great for text but convolutions can do it faster&lt;/li>
&lt;li>Any part of a sentence can influence the semantics of a word. For that reason we want our network to see the entire input at once&lt;/li>
&lt;li>Getting that big a receptive can make gradients vanish and our networks fail&lt;/li>
&lt;li>We can solve the vanishing gradient problem with DenseNets or Dilated Convolutions&lt;/li>
&lt;li>Sometimes we need to generate text. We can use “deconvolutions” to generate arbitrarily long outputs.&lt;/li>
&lt;/ul>
&lt;h3 id="intro">Intro&lt;/h3>
&lt;p>Over the last three years, the field of NLP has gone through a huge revolution thanks to deep learning. The leader of this revolution has been the recurrent neural network and particularly its manifestation as an LSTM. Concurrently the field of computer vision has been reshaped by convolutional neural networks. This post explores what we “text people” can learn from our friends who are doing vision.&lt;/p>
&lt;h3 id="common-nlp-tasks">Common NLP Tasks&lt;/h3>
&lt;p>To set the stage and agree on a vocabulary, I’d like to introduce a few of the more common tasks in NLP. For the sake of consistency, I’ll assume that all of our model’s inputs are characters and that our “unit of observation” is a sentence. Both of these assumptions are just for the sake of convenience and you can replace characters with words and sentences with documents if you so wish.&lt;/p>
&lt;h4 id="classification">Classification&lt;/h4>
&lt;p>Perhaps the oldest trick in the book, we often want to classify a sentence. For example, we might want to classify an email subject as indicative of spam, guess the sentiment of a product review or assign a topic to a document.&lt;/p>
&lt;p>The straightforward way to handle this kind of task with an RNN is to feed entire sentence into it, character by character, and then observe the RNNs final hidden state.&lt;/p>
&lt;h4 id="sequence-labeling">Sequence Labeling&lt;/h4>
&lt;p>Sequence labeling tasks are tasks that return an output for each input. Examples include part of speech labeling or entity recognition tasks. While the bare bones LSTM model is far from the state of the art, it is easy to implement and offers compelling results. See &lt;a href="https://arxiv.org/pdf/1508.01991.pdf">this paper&lt;/a> for a more fleshed out architecture&lt;/p>
&lt;p>&lt;img
src="./bilstm-ner-sequence-labeling.webp"
alt="Bidirectional LSTM sequence labeling architecture"
loading="lazy"
decoding="async"
class="full-width"
/>
&lt;/p>
&lt;h4 id="sequence-generation">Sequence Generation&lt;/h4>
&lt;p>Arguably the most impressive results in recent NLP have been in translation. Translation is a mapping of one sequence to another, with no guarantees on the length of the output sentence. For example, translating the first words of the Bible from Hebrew to English is בראשית = “In the Beginning”.&lt;/p>
&lt;p>At the core of this success is the Sequence to Sequence (AKA encoder decoder) framework, a methodology to “compress” a sequence into a code and then decode it to another sequence. Notable examples include translation (Encode Hebrew and decode to English), image captioning (Encode an Image and decode a textual description of its contents)&lt;/p>
&lt;p>&lt;img
src="./cnn-attention-image-captioning.webp"
alt="Image captioning pipeline with CNN features feeding an attention LSTM decoder"
loading="lazy"
decoding="async"
class="full-width"
/>
&lt;/p>
&lt;p>The basic Encoder step is similar to the scheme we described for classification. What’s amazing is that we can build a decoder that learns to generate arbitrary length outputs.&lt;/p>
&lt;p>The two examples above are really both translation, but sequence generation is a bit broader than that. OpenAI recently &lt;a href="https://blog.openai.com/unsupervised-sentiment-neuron/">published a paper&lt;/a> where they learn to generate “Amazon Reviews” while controlling the sentiment of the output&lt;/p>
&lt;p>&lt;img
src="./sentiment-controlled-examples.webp"
alt="Generated Amazon reviews with sentiment constrained to positive or negative"
loading="lazy"
decoding="async"
class="full-width"
/>
&lt;/p>
&lt;p>Another personal favorite is the paper &lt;a href="https://arxiv.org/pdf/1511.06349.pdf">Generating Sentences from a Continuous Space&lt;/a>. In that paper, they trained a variational autoencoder on text, which led to the ability to interpolate between two sentences and get coherent results.&lt;/p>
&lt;p>&lt;img
src="./sentence-interpolation-samples.webp"
alt="Sentence interpolation samples from a variational autoencoder"
loading="lazy"
decoding="async"
class="full-width"
/>
&lt;/p>
&lt;h3 id="requirements-from-an-nlp-architecture">Requirements from an NLP architecture&lt;/h3>
&lt;p>What all of the implementations we looked at have in common is that they use a recurrent architecture, usually an LSTM (If your not sure what that is, &lt;a href="http://karpathy.github.io/2015/05/21/rnn-effectiveness/">here&lt;/a> is a great intro) . It is worth noting that none of the tasks had recurrent in their name, and none mentioned LSTMs. With that in mind, lets take a moment to think what RNNs and particularly LSTMs provide that make them so ubiquitous in NLP.&lt;/p>
&lt;h4 id="arbitrary-input-size">Arbitrary Input Size&lt;/h4>
&lt;p>A standard feed forward neural network has a parameter for every input. This becomes problematic when dealing with text or images for a few reasons.&lt;/p>
&lt;ol>
&lt;li>It restricts the input size we can handle. Our network will have a finite number of input nodes and won’t be able to grow to more.&lt;/li>
&lt;li>We lose a lot of common information. Consider the sentences “I like to drink beer a lot” and “I like to drink a lot of beer”. A feed forward network would have to learn about the concept of “a lot” twice as it appears in different input nodes each time.&lt;/li>
&lt;/ol>
&lt;p>Recurrent neural networks solve this problem. Instead of having a node for each input, we have a big “box” of nodes that we apply to the input again and again. The “box” learns a sort of transition function, which means that the outputs follow some recurrence relation, hence the name.&lt;/p>
&lt;p>Remember that the &lt;em>vision people&lt;/em> got a lot of the same effect for images using convolutions. That is, instead of having an input node for each pixel, convolutions allowed the reuse of the same, small set of parameters across the entire image.&lt;/p>
&lt;h4 id="long-term-dependencies">Long Term Dependencies&lt;/h4>
&lt;p>The promise of RNNs is their ability to implicitly model long term dependencies. The picture below is taken from OpenAI. They trained a model that ended up recognizing sentiment and colored the text, character by character, with the model’s output. Notice how the model sees the word “best” and triggers a positive sentiment which it carries on for over 100 characters. That’s capturing a long range dependency.&lt;/p>
&lt;p>&lt;img
src="./sentiment-heatmap.webp"
alt="Sentiment neuron activation heatmap over review text"
loading="lazy"
decoding="async"
class="full-width"
/>
&lt;/p>
&lt;p>The theory of RNNs promises us long range dependencies out of the box. The practice is a little more difficult. When we learn via backpropagation, we need to propagate the signal through the entire recurrence relation. The thing is, at every step we end up multiplying by a number. If those numbers are generally smaller than 1, our signal will quickly go to 0. If they are larger than 1, then our signal will explode.&lt;/p>
&lt;p>These issues are called the vanishing and exploding gradient and are generally resolved by LSTMs and a few clever tricks. I mention them know because we’ll encounter these problems again with convolutions and will need another way to address them.&lt;/p>
&lt;h3 id="advantages-of-convolutions">Advantages of convolutions&lt;/h3>
&lt;p>So far we’ve seen how great LSTMs are, but this post is about convolutions. In the spirit of &lt;em>don’t fix what ain’t broken&lt;/em>, we have to ask ourselves why we’d want to use convolutions at all.&lt;/p>
&lt;p>One answer is “ Because we can”.&lt;/p>
&lt;p>But there are two other compelling reasons to use convolutions, speed, and context.&lt;/p>
&lt;h4 id="parrelalisation">Parrelalisation&lt;/h4>
&lt;p>RNNs operate sequentially, the output for the second input depends on the first one and so we can’t parallelise an RNN. Convolutions have no such problem, each “patch” a convolutional kernel operates on is independent of the other meaning that we can go over the entire input layer concurrently.&lt;/p>
&lt;p>There is a price to pay for this, as we’ll see we have to stack convolutions into deep layers in order to view the entire input and each of those layers is calculated sequentially. But the calculations at each layer happen concurrently and each individual computation is small (compared to an LSTM) such that in practice we get a big speed up.&lt;/p>
&lt;p>When I set out to write this I only had my own experience and Google’s ByteNet to back this claim up. Just this week, Facebook published their fully convolutional translation model and reported a 9X speed up over LSTM based models.&lt;/p>
&lt;h4 id="view-the-whole-input-at-once">View the whole input at once&lt;/h4>
&lt;p>LSTMs read their input from left to right (or right to left) but sometimes we’d like to have the context of the end of the sentence influence the networks thoughts about its begining. For example, we might have a sentence like “I’d love to buy your product. Not!” and we’d like that negation at the end to influence the entire sentence.&lt;/p>
&lt;p>With LSTMs we achieve this by running two LSTMs, one left to right and the other right to left and concatenating their outputs. This works well in practice but doubles our computational load.&lt;/p>
&lt;p>Convolutions, on the other hand, grow a larger “receptive field” as we stack more and more layers. That means that by default, each “step” in the convolution’s representation views all of the input in its receptive field, from before and after it. I’m not aware of any definitive argument that this is inherently better than an LSTM, but it does give us the desired effect in a controllable fashion and with a low computational cost.&lt;/p>
&lt;p>So far we’ve set up our problem domain and talked a bit about the conceptual advantages of convolutions for NLP. From here out, I’d like to translate those concepts into practical methods that we can use to analyze and construct our networks.&lt;/p>
&lt;h3 id="practical-convolutions-for-text">Practical convolutions for text&lt;/h3>
&lt;p>&lt;img
src="./convolution-animation.webp"
alt="Animated visualization of a convolutional kernel sliding over an image"
loading="lazy"
decoding="async"
class="full-width"
/>
&lt;/p>
&lt;p>You’ve probably seen an animation like the one above illustrating what a convolution does. The bottom is an input image, the top is the result and the gray shadow is the convolutional kernel which is repeatedly applied.&lt;/p>
&lt;p>This all makes perfect sense except that the input described in the picture is an image, with two spatial dimensions (height and width). We’re talking about text, which has only one dimension, and it’s temporal not spatial.&lt;/p>
&lt;p>For all practical purposes, that doesn’t matter. We just need to think of our text as an image of width &lt;em>n&lt;/em> and height 1. Tensorflow provides a conv1d function that does that for us, but it does not expose other convolutional operations in their 1d version.&lt;/p>
&lt;p>To make the “Text = an image of height 1” idea concrete, let’s see how we’d use the 2d convolutional op in Tensorflow on a sequence of embedded tokens.&lt;/p>
&lt;p>So what we’re doing here is changing the shape of input with tf.expand_dims so that it becomes an “Image of height 1”. After running the 2d convolution operator we squeeze away the extra dimension.&lt;/p>
&lt;h3 id="hierarchy-and-receptive-fields">Hierarchy and Receptive Fields&lt;/h3>
&lt;p>&lt;img
src="./cnn-receptive-hierarchy.webp"
alt="Hierarchy of CNN filters progressing from edges to faces"
loading="lazy"
decoding="async"
class="full-width"
/>
&lt;/p>
&lt;p>Many of us have seen pictures like the one above. It roughly shows the hierarchy of abstractions a CNN learns on images. In the first layer, the network learns basic edges. In the next layer, it combines those edges to learn more abstract concepts like eyes and noses. Finally, it combines those to recognize individual faces.&lt;/p>
&lt;p>With that in mind, we need to remember that each layer doesn’t just learn more abstract combinations of the previous layer. Successive layers, implicitly or explicitly, see more of the input&lt;/p>
&lt;p>&lt;img
src="./hierarchical-receptive-field-tree.webp"
alt="Receptive field tree showing layered aggregation across inputs"
loading="lazy"
decoding="async"
class="full-width"
/>
&lt;/p>
&lt;h4 id="increasing-receptive-field">Increasing Receptive Field&lt;/h4>
&lt;p>With vision often we’ll want the network to identify one or more objects in the picture while ignoring others. That is, we’ll be interested in some local phenomenon but not in a relationship that spans the entire input.&lt;/p>
&lt;p>&lt;img
src="./hotdog-classifier-example.webp"
alt="Hotdog classifier app comparing hotdog and shoe photos"
loading="lazy"
decoding="async"
class="full-width"
/>
&lt;/p>
&lt;p>Text is more subtle as often we’ll want intermediate representations of our data to carry as much context about their surroundings as they possibly can. In other words, we want to have as large a receptive field as possible. Their are a few ways to go about this.&lt;/p>
&lt;h4 id="larger-filters">Larger Filters&lt;/h4>
&lt;p>The first, most obvious, way is to increase the filter size, that is doing a [1x5] convolution instead of a [1x3]. In my work with text, I’ve not had great results with this and I’ll offer my speculations as to why.&lt;/p>
&lt;p>In my domain, I mostly deal with character level inputs and with texts that are morphologicaly very rich. I think of (at least the first) layers of convolution as learning n-grams so that the width of the filter corresponds to bigrams, trigrams etc. Having the network learn larger n-grams early exposes it to fewer examples, as there are more occurrences of “ab” in a text than “abb”.&lt;/p>
&lt;p>I’ve never proved this interpretation but have gotten consistently poorer results with filter widths larger than 3.&lt;/p>
&lt;h4 id="adding-layers">Adding Layers&lt;/h4>
&lt;p>As we saw in the picture above, adding more layers will increase the receptive field. &lt;a href="https://medium.com/u/b04dc6044cc">Dang Ha The Hien&lt;/a> wrote a &lt;a href="https://medium.com/@nikasa1889/a-guide-to-receptive-field-arithmetic-for-convolutional-neural-networks-e0f514068807">great guide&lt;/a> to calculating the receptive field at each layer which I encourage you to read.&lt;/p>
&lt;p>Adding layers has two distinct but related effects. The one that gets thrown around a lot is that the model will learn to make higher level abstractions over the inputs that it gets (Pixels =&amp;gt;Edges =&amp;gt; Eyes =&amp;gt; Face). The other is that the receptive field grows at each step .&lt;/p>
&lt;p>This means that given enough depth, our network could look at the entire input layer though perhaps through a haze of abstractions. Unfortunately this is where the vanishing gradient problem may rear its ugly head.&lt;/p>
&lt;h4 id="the-gradient--receptive-field-trade-off">The Gradient / Receptive field trade off&lt;/h4>
&lt;p>Neural networks are networks that information flows through. In the forward pass our input flows and transforms, hopefully becoming a representation that is more amenable to our task. During the back phase we propagate a signal, the gradient, back through the network. Just like in vanilla RNNs, that signal gets multiplied frequently and if it goes through a series of numbers that are smaller than 1 then it will fade to 0. That means that our network will end up with very little signal to learn from.&lt;/p>
&lt;p>This leaves us with something of a tradeoff. On the one hand, we’d like to be able to take in as much context as possible. On the other hand, if we try to increase our receptive fields by stacking layers we risk vanishing gradients and a failure to learn anything.&lt;/p>
&lt;h3 id="two-solutions-to-the-vanishing-gradient-problem">Two Solutions to the Vanishing Gradient Problem&lt;/h3>
&lt;p>Luckily, many smart people have been thinking about these problems. Luckier still, these aren’t problems that are unique to text, the &lt;em>vision people&lt;/em> also want larger receptive fields and information rich gradients. Let’s take a look at some of their crazy ideas and use them to further our own textual glory.&lt;/p>
&lt;h4 id="residual-connections">Residual Connections&lt;/h4>
&lt;p>2016 was another great year for the &lt;em>vision people&lt;/em> with at least two very popular architectures emerging, &lt;a href="https://arxiv.org/abs/1512.03385">ResNets&lt;/a> and &lt;a href="https://arxiv.org/abs/1608.06993">DenseNets&lt;/a> (The DenseNet paper, in particular, is exceptionally well written and well worth the read) . Both of them address the same problem “How do I make my network very deep without losing the gradient signal?”&lt;/p>
&lt;p>&lt;a href="https://medium.com/u/18dfe63fa7f0">Arthur Juliani&lt;/a> wrote a fantastic overview of &lt;a href="https://chatbotslife.com/resnets-highwaynets-and-densenets-oh-my-9bb15918ee32">Resnet, DenseNets and Highway networks&lt;/a> for those of you looking for the details and comparison. I’ll briefly touch on DenseNets which take the core concept to its extreme.&lt;/p>
&lt;p>&lt;img
src="./densenet-connections.webp"
alt="DenseNet block with densely connected convolutional layers"
loading="lazy"
decoding="async"
class="full-width"
/>
&lt;/p>
&lt;p>The general idea is to reduce the distance between the signal coming from the networks loss and each individual layer. The way this is done is by adding a residual/direct connection between every layer and its predecessors. That way, the gradient can flow from each layer to its predecessors directly.&lt;/p>
&lt;p>DenseNets do this in a particularly interesting way. They concatenate the output of each layer to its input such that:&lt;/p>
&lt;ol>
&lt;li>We start with an embedding of our inputs, say of dimension 10.&lt;/li>
&lt;li>Our first layer calculates 10 feature maps. It outputs the 10 feature maps concatenated to the original embedding.&lt;/li>
&lt;li>The second layer gets as input 20 dimensional vectors (10 from the input and 10 from the previous layer) and calculates another 10 feature maps. Thus it outputs 30 dimensional vectors.&lt;/li>
&lt;/ol>
&lt;p>And so on and so on for as many layers as you’d like. The paper describes a boat load of tricks to make things manageable and efficient but that’s the basic premise and the vanishing gradient problem is solved.&lt;/p>
&lt;p>There are two other things I’d like to point out.&lt;/p>
&lt;ol>
&lt;li>I previously mentioned that upper layers have a view of the original input that may be hazed by layers of abstraction. One of the highlights of concatenating the outputs of each layer is that the original signal reaches the following layers intact, so that all layers have a direct view of lower level features, essentially removing some of the haze.&lt;/li>
&lt;li>The Residual connection trick requires that all of our layers have the same shape. That means that we need to pad each layer so that its input and output have the same spatial dimensions [1Xwidth]. That means that, on its own, this kind of architecture will work for sequence labeling tasks (Where the input and the output have the same spatial dimensions) but will need more work for encoding and classification tasks (Where we need to reduce the input to a fixed size vector or set of vectors). The DenseNet paper actually handles this as their goal is to do classification and we’ll expand on this point later.&lt;/li>
&lt;/ol>
&lt;h4 id="dilated-convolutions">Dilated Convolutions&lt;/h4>
&lt;p>Dilated convolutions AKA &lt;em>atrous&lt;/em> convolutions AKA convolutions with holes are another method of increasing the receptive field without angering the gradient gods. When we looked at stacking layers so far, we saw that the receptive field grows linearly with depth. Dilated convolutions let us grow the receptive field exponentially with depth.&lt;/p>
&lt;p>You can find an almost accessible explanation of dilated convolutions in the paper &lt;a href="https://arxiv.org/pdf/1511.07122.pdf">Multi scale context aggregation by dilated convolutions&lt;/a> which uses them for vision. While conceptually simple, it took me a while to understand exactly what they do, and I may still have it not quite right.&lt;/p>
&lt;p>The basic idea is to introduce “holes” into each filter, so that it doesn’t operate on adjacent parts of the input but rather skips over them to parts further away. Note that this is different from applying a convolution with stride &amp;gt;1. When we stride a filter, we skip over parts of the input between applications of the convolution. With dilated convolutions, we skip over parts of the input within a single application of the convolution. By cleverly arranging growing dilations we can achieve the promised exponential growth in receptive fields.&lt;/p>
&lt;p>We’ve talked a lot of theory so far, but we’re finally at a point where we can see this stuff in action!&lt;/p>
&lt;p>A personal favorite paper is &lt;a href="https://arxiv.org/pdf/1610.10099.pdf">Neural Machine Translation in Linear Time&lt;/a>. It follows the encoder decoder structure we talked about in the beginning. We still don’t have all the tools to talk about the decoder, but we can see the encoder in action.&lt;/p>
&lt;p>&lt;img
src="./dilated-convolution-receptive-field.webp"
alt="Dilated convolution encoder with expanding receptive fields over a sequence"
loading="lazy"
decoding="async"
class="full-width"
/>
&lt;/p>
&lt;p>And here’s an English input&lt;/p>
&lt;blockquote>
&lt;p>Director Jon Favreau, who is currently working on Disney’s forthcoming Jungle Book film, told the website Hollywood Reporter: “I think times are changing.”&lt;/p>
&lt;/blockquote>
&lt;p>And its translation, brought to you by dilated convolutions&lt;/p>
&lt;blockquote>
&lt;p>Regisseur Jon Favreau, der zur Zeit an Disneys kommendem Jungle Book Film arbeitet, hat der Website Hollywood Reporter gesagt: “Ich denke, die Zeiten andern sich”.&lt;/p>
&lt;/blockquote>
&lt;p>And as a bonus, remember that sound is just like text, in the sense that it has just one spatial/temporal dimension. Check out DeepMind’s &lt;a href="https://deepmind.com/blog/wavenet-generative-model-raw-audio/">Wavenet&lt;/a> which uses dilated convolutions (and a lot of other magic) to generate &lt;a href="https://storage.googleapis.com/deepmind-media/pixie/knowing-what-to-say/second-list/speaker-1.wav">human sounding speech&lt;/a> and &lt;a href="https://storage.googleapis.com/deepmind-media/pixie/making-music/sample_4.wav">piano music&lt;/a>.&lt;/p>
&lt;h3 id="getting-stuff-out-of-your-network">Getting Stuff Out of your network&lt;/h3>
&lt;p>When we discussed DenseNets I mentioned that the use of residual connections forces us to keep the input and output length of our sequence the same, which is done via padding. This is great for tasks where we need to label each item in our sequence for example:&lt;/p>
&lt;ul>
&lt;li>In parts of speech tagging where each word is a part of speech.&lt;/li>
&lt;li>In entity recognition where we might label Person, Company, and Other for everything else&lt;/li>
&lt;/ul>
&lt;p>Other times we’ll want to reduce our input sequence down to a vector representation and use that to predict something about the entire sentence&lt;/p>
&lt;ul>
&lt;li>We might want to label an email as spam based on its content and or subject&lt;/li>
&lt;li>Predict if a certain sentence is sarcastic or not&lt;/li>
&lt;/ul>
&lt;p>In these cases, we can follow the traditional approaches of the &lt;em>vision people&lt;/em> and top off our network with convolutional layers that don’t have padding and/or use pooling operations.&lt;/p>
&lt;p>But sometimes we’ll want to follow the Seq2Seq paradigm, what &lt;a href="https://medium.com/u/42936aed59d2">Matthew Honnibal&lt;/a> succinctly called &lt;a href="https://explosion.ai/blog/deep-learning-formula-nlp">&lt;em>Embed, encode, attend, predict&lt;/em>&lt;/a>&lt;em>.&lt;/em> In this case, we reduce our input down to some vector representation but then need to somehow up sample that vector back to a sequence of the proper length.&lt;/p>
&lt;p>This task entails two problems&lt;/p>
&lt;ul>
&lt;li>How do we do upsampling with convolutions ?&lt;/li>
&lt;li>How do we do exactly the right amount of up sampling?&lt;/li>
&lt;/ul>
&lt;p>I still haven’t found the answer to the second question or at least have not yet understood it. In practice, it’s been enough for me to assume some upper bound on the maximum length of the output and then upsample to that point. I suspect Facebooks new &lt;a href="https://s3.amazonaws.com/fairseq/papers/convolutional-sequence-to-sequence-learning.pdf">translation paper&lt;/a> may address this but have not yet read it deeply enough to comment.&lt;/p>
&lt;h4 id="upsampling-with-deconvolutions">Upsampling with deconvolutions&lt;/h4>
&lt;p>Deconvolutions are our tool for upsampling. It’s easiest (for me) to understand what they do through visualizations. Luckily, a few smart folks published a &lt;a href="http://distill.pub/2016/deconv-checkerboard/">great post on deconvolutions&lt;/a> over at Distill and included some fun visualizers. Lets start with those.&lt;/p>
&lt;p>&lt;img
src="./strided-convolution-diagram.webp"
alt="Strided convolution diagram showing kernel covering inputs"
loading="lazy"
decoding="async"
class="full-width"
/>
&lt;/p>
&lt;p>Consider the image on top. If we take the bottom layer as the input we have a standard convolution of stride 1 and width 3. &lt;em>But,&lt;/em> we can also go from top down, that is treat the top layer as the input and get the slightly larger bottom layer.&lt;/p>
&lt;p>If you stop to think about that for a second, this “top down” operation is already happening in your convolutional networks when you do back propagation, as the gradient signals need to propagate in exactly the way shown in the picture. Even better, it turns out that this operation is simply the transpose of the convolution operation, hence the other common (and technically correct) name for this operation, transposed convolution.&lt;/p>
&lt;p>Here’s where it gets fun. We can stride our convolutions to shrink our input. Thus we can stride our deconvolutions to grow our input. I think the easiest way to understand how strides work with deconvolutions is to look at the following pictures.&lt;/p>
&lt;p>&lt;img
src="./strided-convolution-diagram.webp"
alt="Strided convolution diagram showing kernel covering inputs"
loading="lazy"
decoding="async"
class="full-width"
/>
&lt;img
src="./transposed-convolution-overlap.webp"
alt="Transposed convolution with overlapping coverage of outputs"
loading="lazy"
decoding="async"
class="full-width"
/>
&lt;/p>
&lt;p>We’ve already seen the top one. Notice that each input (the top layer) feeds three of the outputs and that each of the outputs is fed by three inputs (except the edges).&lt;/p>
&lt;p>&lt;img
src="./dilated-convolution-spacing.webp"
alt="Dilated convolution spacing with gaps widening receptive field"
loading="lazy"
decoding="async"
class="full-width"
/>
&lt;/p>
&lt;p>In the second picture we place imaginary holes in our inputs. Notice that now each of the outputs is fed by at most two inputs.&lt;/p>
&lt;p>&lt;img
src="./transposed-convolution-upscaling.webp"
alt="Transposed convolution upscaling a sequence length"
loading="lazy"
decoding="async"
class="full-width"
/>
&lt;/p>
&lt;p>In the third picture we’ve added two imaginary holes into out input layer and so each output is fed by exactly one input. This ends up tripling the sequence length of our output with respect to the sequence length of our input.&lt;/p>
&lt;p>Finally, we can stack multiple deconvolutional layers to gradually grow our output layer to the desired size.&lt;/p>
&lt;p>A few things worth thinking about&lt;/p>
&lt;ol>
&lt;li>If you look at these drawings from bottom up, they end up being standard strided convolutions where we just added imaginary holes at the output layers (The white blocks)&lt;/li>
&lt;li>In practice, each “input” isn’t a single number but a vector. In the image world, it might be a 3 dimensional RGB value. In text it might be a 300 dimensional word embedding. If you’re (de)convolving in the middle of your network each point would be a vector of whatever size came out of the last layer.&lt;/li>
&lt;li>I point that out to convince you that their is enough information in the input layer of a deconvolution to spread across a few points in the output.&lt;/li>
&lt;li>In practice, I’ve had success running a few convolutions with length preserving padding after a deconvolution. I imagine, though haven’t proven, that this acts like a redistribution of information. I think of it like letting a steak rest after grilling to let the juices redistribute.&lt;/li>
&lt;/ol>
&lt;p>&lt;img
src="./steak-resting-comparison.webp"
alt="Comparison of steak not rested versus rested to illustrate information redistribution"
loading="lazy"
decoding="async"
class="full-width"
/>
&lt;/p>
&lt;h3 id="summary">Summary&lt;/h3>
&lt;p>The main reason you might want to consider convolutions in your work is because they are fast. I think that’s important to make research and exploration faster and more efficient. Faster networks shorten our feedback cycles.&lt;/p>
&lt;p>Most of the tasks I’ve encountered with text end up having the same requirement of the architecture: Maximize the receptive field while maintaining an adequate flow of gradients. We’ve seen the use of both DenseNets and dilated convolutions to achieve that.&lt;/p>
&lt;p>Finally, sometimes we want to expand a sequence or a vector into a larger sequence. We looked at deconvolutions as a way to do “upsampling” on text and as a bonus compared adding a convolution afterwards the letting a steak rest and redistribute its juices.&lt;/p>
&lt;p>I’d love to learn more about your thoughts and experiences with these kinds of models. Share in the comments or ping me on twitter &lt;a href="https://twitter.com/thetalperry">@thetalperry&lt;/a>&lt;/p></description><author/><guid>https://talperry.com/en/posts/classics/cmft/</guid><pubDate>Mon, 22 May 2017 00:00:00 +0000</pubDate></item><item><title>Deep Learning The Stock Market</title><link>https://talperry.com/en/posts/classics/dlsm/</link><description>&lt;p>&lt;em>&lt;strong>Update 15.03.2024&lt;/strong> I wrote this more than seven years ago. My understanding has evolved since then, and the world of deep learning has gone through more than one revolution since. It was popular back in the day and might still be a fun read though you might learn more accurate and upto date information somewhere else&lt;/em>&lt;/p>
&lt;p>&lt;em>&lt;strong>Update 25.1.17&lt;/strong> — Took me a while but&lt;/em> &lt;a href="https://github.com/talolard/MarketVectors/blob/master/preparedata.ipynb">&lt;em>here is an ipython notebook&lt;/em>&lt;/a> &lt;em>with a rough implementation&lt;/em>&lt;/p>
&lt;p>&lt;img
src="./performance-plot-market-returns.webp"
alt="Cumulative return comparison for different trading signals"
loading="lazy"
decoding="async"
class="full-width"
/>
&lt;/p>
&lt;h2 id="why-nlp-is-relevant-to-stock-prediction">Why NLP is relevant to Stock prediction&lt;/h2>
&lt;p>In many NLP problems we end up taking a sequence and encoding it into a single fixed size representation, then decoding that representation into another sequence. For example, we might tag entities in the text, translate from English to French or convert audio frequencies to text. There is a torrent of work coming out in these areas and a lot of the results are achieving state of the art performance.&lt;/p>
&lt;p>In my mind the biggest difference between the NLP and financial analysis is that language has some guarantee of structure, it’s just that the rules of the structure are vague. Markets, on the other hand, don’t come with a promise of a learnable structure, that such a structure exists is the assumption that this project would prove or disprove (rather it might prove or disprove if I can find that structure).&lt;/p>
&lt;p>Assuming the structure is there, the idea of summarizing the current state of the market in the same way we encode the semantics of a paragraph seems plausible to me. If that doesn’t make sense yet, keep reading. It will.&lt;/p>
&lt;h2 id="you-shall-know-a-word-by-the-company-it-keeps-firth-j-r-195711">You shall know a word by the company it keeps (Firth, J. R. 1957:11)&lt;/h2>
&lt;p>There is tons of literature on word embeddings. &lt;a href="https://www.youtube.com/watch?v=xhHOL3TNyJs&amp;index=2&amp;list=PLmImxx8Char9Ig0ZHSyTqGsdhb9weEGam">Richard Socher’s lecture&lt;/a> is a great place to start. In short, we can make a geometry of all the words in our language, and that geometry captures the meaning of words and relationships between them. You may have seen the example of “King-man +woman=Queen” or something of the sort.&lt;/p>
&lt;p>&lt;img
src="./shakespeare-code-sample.webp"
alt="Embedding geometry example highlighting nearest neighbors for the word frog"
loading="lazy"
decoding="async"
class="full-width"
/>
&lt;/p>
&lt;p>Embeddings are cool because they let us represent information in a condensed way. The old way of representing words was holding a vector (a big list of numbers) that was as long as the number of words we know, and setting a 1 in a particular place if that was the current word we are looking at. That is not an efficient approach, nor does it capture any meaning. With embeddings, we can represent all of the words in a fixed number of dimensions (300 seems to be plenty, 50 works great) and then leverage their higher dimensional geometry to understand them.&lt;/p>
&lt;p>The picture below shows an example. An embedding was trained on more or less the entire internet. After a few days of intensive calculations, each word was embedded in some high dimensional space. This “space” has a geometry, concepts like distance, and so we can ask which words are close together. The authors/inventors of that method made an example. Here are the words that are closest to Frog.&lt;/p>
&lt;p>&lt;img
src="./word2vec-neighbors-frog.webp"
alt="Nearest neighbors list for the word frog from a word2vec model"
loading="lazy"
decoding="async"
class="full-width"
/>
&lt;/p>
&lt;p>But we can embed more than just words. We can do, say , stock market embeddings.&lt;/p>
&lt;h2 id="market2vec">Market2Vec&lt;/h2>
&lt;p>The first word embedding algorithm I heard about was word2vec. I want to get the same effect for the market, though I’ll be using a different algorithm. My input data is a csv, the first column is the date, and there are 4*1000 columns corresponding to the High Low Open Closing price of 1000 stocks. That is my input vector is 4000 dimensional, which is too big. So the first thing I’m going to do is stuff it into a lower dimensional space, say 300 because I liked the movie.
&lt;img
src="./market-embedding-diagram.webp"
alt="Market2Vec embedding diagram compressing 4000 dimensional prices to 300"
loading="lazy"
decoding="async"
class="full-width"
/>
&lt;/p>
&lt;p>Taking something in 4000 dimensions and stuffing it into a 300-dimensional space my sound hard but its actually easy. We just need to multiply matrices. A matrix is a big excel spreadsheet that has numbers in every cell and no formatting problems. Imagine an excel table with 4000 columns and 300 rows, and when we basically bang it against the vector a new vector comes out that is only of size 300. I wish that’s how they would have explained it in college.&lt;/p>
&lt;p>The fanciness starts here as we’re going to set the numbers in our matrix at random, and part of the “deep learning” is to update those numbers so that our excel spreadsheet changes. Eventually this matrix spreadsheet (I’ll stick with matrix from now on) will have numbers in it that bang our original 4000 dimensional vector into a concise 300 dimensional summary of itself.&lt;/p>
&lt;p>We’re going to get a little fancier here and apply what they call an activation function. We’re going to take a function, and apply it to each number in the vector individually so that they all end up between 0 and 1 (or 0 and infinity, it depends). Why ? It makes our vector more special, and makes our learning process able to understand more complicated things. &lt;a href="https://lmgtfy.com/?q=why+does+deep+learning+use+non+linearities">How&lt;/a>?&lt;/p>
&lt;p>So what? What I’m expecting to find is that the new embedding of the market prices (the vector) into a smaller space captures all the essential information for the task at hand, without wasting time on the other stuff. So I’d expect they’d capture correlations between other stocks, perhaps notice when a certain sector is declining or when the market is very hot. I don’t know what traits it will find, but I assume they’ll be useful.&lt;/p>
&lt;h2 id="now-what">Now What&lt;/h2>
&lt;p>Lets put aside our market vectors for a moment and talk about language models. &lt;a href="https://medium.com/u/ac9d9a35533e">Andrej Karpathy&lt;/a> wrote the epic post “&lt;a href="http://karpathy.github.io/2015/05/21/rnn-effectiveness/">The Unreasonable effectiveness of Recurrent Neural Networks&lt;/a>”. If I’d summarize in the most liberal fashion the post boils down to&lt;/p>
&lt;ol>
&lt;li>If we look at the works of Shakespeare and go over them character by character, we can use “deep learning” to learn a language model.&lt;/li>
&lt;li>A language model (in this case) &lt;strong>is a magic box&lt;/strong>. You put in the first few characters and it tells you what the next one will be.&lt;/li>
&lt;li>If we take the character that the language model predicted and feed it back in we can keep going forever.&lt;/li>
&lt;/ol>
&lt;p>And then as a punchline, he generated a bunch of text that looks like Shakespeare. And then he did it again with the Linux source code. And then again with a textbook on Algebraic geometry.&lt;/p>
&lt;p>So I’ll get back to the mechanics of that magic box in a second, but let me remind you that we want to predict the future market based on the past just like he predicted the next word based on the previous one. Where Karpathy used characters, we’re going to use our market vectors and feed them into the magic black box. We haven’t decided what we want it to predict yet, but that is okay, we won’t be feeding its output back into it either.&lt;/p>
&lt;h2 id="going-deeper">Going deeper&lt;/h2>
&lt;p>I want to point out that this is where we start to get into the deep part of deep learning. So far we just have a single layer of learning, that excel spreadsheet that condenses the market. Now we’re going to add a few more layers and stack them, to make a “deep” something. That’s the deep in deep learning.&lt;/p>
&lt;p>So Karpathy shows us some sample output from the Linux source code, this is stuff his black box wrote.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-cpp" data-lang="cpp">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">static&lt;/span> &lt;span style="color:#66d9ef">void&lt;/span> &lt;span style="color:#a6e22e">action_new_function&lt;/span>(&lt;span style="color:#66d9ef">struct&lt;/span> &lt;span style="color:#a6e22e">s_stat_info&lt;/span> &lt;span style="color:#f92672">*&lt;/span>wb)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>{
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">unsigned&lt;/span> &lt;span style="color:#66d9ef">long&lt;/span> flags;
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">int&lt;/span> lel_idx_bit &lt;span style="color:#f92672">=&lt;/span> e&lt;span style="color:#f92672">-&amp;gt;&lt;/span>edd, &lt;span style="color:#f92672">*&lt;/span>sys &lt;span style="color:#f92672">&amp;amp;&lt;/span> &lt;span style="color:#f92672">~&lt;/span>((&lt;span style="color:#66d9ef">unsigned&lt;/span> &lt;span style="color:#66d9ef">long&lt;/span>) &lt;span style="color:#f92672">*&lt;/span>FIRST_COMPAT);
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> buf[&lt;span style="color:#ae81ff">0&lt;/span>] &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#ae81ff">0xFFFFFFFF&lt;/span> &lt;span style="color:#f92672">&amp;amp;&lt;/span> (bit &lt;span style="color:#f92672">&amp;lt;&amp;lt;&lt;/span> &lt;span style="color:#ae81ff">4&lt;/span>);
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> min(inc, slist&lt;span style="color:#f92672">-&amp;gt;&lt;/span>bytes);
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> printk(KERN_WARNING &lt;span style="color:#e6db74">&amp;#34;Memory allocated %02x/%02x, &amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#e6db74">&amp;#34;original MLL instead&lt;/span>&lt;span style="color:#ae81ff">\n&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>),
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> min(min(multi_run &lt;span style="color:#f92672">-&lt;/span> s&lt;span style="color:#f92672">-&amp;gt;&lt;/span>len, max) &lt;span style="color:#f92672">*&lt;/span> num_data_in),
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> frame_pos, sz &lt;span style="color:#f92672">+&lt;/span> first_seg);
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> div_u64_w(val, inb_p);
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> spin_unlock(&lt;span style="color:#f92672">&amp;amp;&lt;/span>disk&lt;span style="color:#f92672">-&amp;gt;&lt;/span>queue_lock);
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> mutex_unlock(&lt;span style="color:#f92672">&amp;amp;&lt;/span>s&lt;span style="color:#f92672">-&amp;gt;&lt;/span>sock&lt;span style="color:#f92672">-&amp;gt;&lt;/span>mutex);
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> mutex_unlock(&lt;span style="color:#f92672">&amp;amp;&lt;/span>func&lt;span style="color:#f92672">-&amp;gt;&lt;/span>mutex);
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">return&lt;/span> disassemble(info&lt;span style="color:#f92672">-&amp;gt;&lt;/span>pending_bh);
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Notice that it knows how to open and close parentheses, and respects indentation conventions; The contents of the function are properly indented and the multi-line &lt;em>printk&lt;/em> statement has an inner indentation. That means that this magic box understands long range dependencies. When it’s indenting within the print statement it knows it’s in a print statement and also remembers that it’s in a function( Or at least another indented scope). &lt;strong>That’s nuts.&lt;/strong> It’s easy to gloss over that but an algorithm that has the ability to capture and remember long term dependencies is super useful because… We want to find long term dependencies in the market.&lt;/p>
&lt;h2 id="inside-the-magical-black-box">Inside the magical black box&lt;/h2>
&lt;p>What’s inside this magical black box? It is a type of Recurrent Neural Network (RNN) called an LSTM. An RNN is a deep learning algorithm that operates on sequences (like sequences of characters). At every step, it takes a representation of the next character (Like the embeddings we talked about before) and operates on the representation with a matrix, like we saw before. The thing is, the RNN has some form of internal memory, so it remembers what it saw previously. It uses that memory to decide how exactly it should operate on the next input. Using that memory, the RNN can “remember” that it is inside of an intended scope and that is how we get properly nested output text.&lt;/p>
&lt;p>&lt;img
src="./nested-scope-code-structure.webp"
alt="LSTM unfolded through time showing how hidden state carries indentation context"
loading="lazy"
decoding="async"
class="full-width"
/>
&lt;/p>
&lt;p>A fancy version of an RNN is called a Long Short Term Memory (LSTM). LSTM has cleverly designed memory that allows it to&lt;/p>
&lt;ol>
&lt;li>Selectively choose what it remembers&lt;/li>
&lt;li>Decide to forget&lt;/li>
&lt;li>Select how much of it’s memory it should output.&lt;/li>
&lt;/ol>
&lt;p>&lt;img
src="./lstm-memory-gates.webp"
alt="Diagram of LSTM gates controlling memory input output and forget operations"
loading="lazy"
decoding="async"
class="full-width"
/>
&lt;/p>
&lt;p>So an LSTM can see a “{“ and say to itself “Oh yeah, that’s important I should remember that” and when it does, it essentially remembers an indication that it is in a nested scope. Once it sees the corresponding “}” it can decide to forget the original opening brace and thus forget that it is in a nested scope.&lt;/p>
&lt;p>We can have the LSTM learn more abstract concepts by stacking a few of them on top of each other, that would make us “Deep” again. Now each output of the previous LSTM becomes the inputs of the next LSTM, and each one goes on to learn higher abstractions of the data coming in. In the example above (and this is just illustrative speculation), the first layer of LSTMs might learn that characters separated by a space are “words”. The next layer might learn word types like (&lt;code>**static** **void** **action_new_function).**&lt;/code>The next layer might learn the concept of a function and its arguments and so on. It’s hard to tell exactly what each layer is doing, though Karpathy’s blog has a really nice example of how he did visualize exactly that.&lt;/p>
&lt;h2 id="connecting-market2vec-and-lstms">Connecting Market2Vec and LSTMs&lt;/h2>
&lt;p>The studious reader will notice that Karpathy used characters as his inputs, not embeddings (Technically a one-hot encoding of characters). But, Lars Eidnes actually used word embeddings when he wrote &lt;a href="https://larseidnes.com/2015/10/13/auto-generating-clickbait-with-recurrent-neural-networks/">Auto-Generating Clickbait With Recurrent Neural Network&lt;/a>&lt;/p>
&lt;p>&lt;img
src="./stacked-lstm-architecture.webp"
alt="Stacked LSTM architecture consuming word vectors and passing outputs upward"
loading="lazy"
decoding="async"
class="full-width"
/>
&lt;/p>
&lt;p>The figure above shows the network he used. Ignore the SoftMax part (we’ll get to it later). For the moment, check out how on the bottom he puts in a sequence of words vectors at the bottom and each one. (Remember, a “word vector” is a representation of a word in the form of a bunch of numbers, like we saw in the beginning of this post). Lars inputs a sequence of Word Vectors and each one of them:&lt;/p>
&lt;ol>
&lt;li>Influences the first LSTM&lt;/li>
&lt;li>Makes it’s LSTM output something to the LSTM above it&lt;/li>
&lt;li>Makes it’s LSTM output something to the LSTM for the next word&lt;/li>
&lt;/ol>
&lt;p>We’re going to do the same thing with one difference, instead of word vectors we’ll input “MarketVectors”, those market vectors we described before. To recap, the MarketVectors should contain a summary of what’s happening in the market at a given point in time. By putting a sequence of them through LSTMs I hope to capture the long term dynamics that have been happening in the market. By stacking together a few layers of LSTMs I hope to capture higher level abstractions of the market’s behavior.&lt;/p>
&lt;h2 id="what-comes-out">What Comes out&lt;/h2>
&lt;p>&lt;em>Thus far we haven’t talked at all about how the algorithm actually learns anything, we just talked about all the clever transformations we’ll do on the data. We’ll defer that conversation to a few paragraphs down, but please keep this part in mind as it is the se up for the punch line that makes everything else worthwhile.&lt;/em>&lt;/p>
&lt;p>In Karpathy’s example, the output of the LSTMs is a vector that represents the next character in some abstract representation. In Eidnes’ example, the output of the LSTMs is a vector that represents what the next word will be in some abstract space. The next step in both cases is to change that abstract representation into a probability vector, that is a list that says how likely each character or word respectively is likely to appear next. That’s the job of the SoftMax function. Once we have a list of likelihoods we select the character or word that is the most likely to appear next.&lt;/p>
&lt;p>In our case of “predicting the market”, we need to ask ourselves what exactly we want to market to predict? Some of the options that I thought about were:&lt;/p>
&lt;ol>
&lt;li>Predict the next price for each of the 1000 stocks&lt;/li>
&lt;li>Predict the value of some index (S&amp;amp;P, VIX etc) in the next &lt;em>n&lt;/em> minutes.&lt;/li>
&lt;li>Predict which of the stocks will move up by more than &lt;em>x%&lt;/em> in the next &lt;em>n&lt;/em> minutes&lt;/li>
&lt;li>(My personal favorite) Predict which stocks will go up/down by &lt;em>2x%&lt;/em> in the next &lt;em>n&lt;/em> minutes while not going &lt;em>down/up&lt;/em> by more than &lt;em>x%&lt;/em> in that time.&lt;/li>
&lt;li>(The one we’ll follow for the remainder of this article). Predict when the VIX will go up/down by &lt;em>2x%&lt;/em> in the next &lt;em>n&lt;/em> minutes while not going &lt;em>down/up&lt;/em> by more than &lt;em>x%&lt;/em> in that time.&lt;/li>
&lt;/ol>
&lt;p>1 and 2 are regression problems, where we have to predict an actual number instead of the likelihood of a specific event (like the letter n appearing or the market going up). Those are fine but not what I want to do.&lt;/p>
&lt;p>3 and 4 are fairly similar, they both ask to predict an event (In technical jargon — a class label). An event could be the letter &lt;em>n&lt;/em> appearing next or it could be &lt;em>Moved up 5% while not going down more than 3% in the last 10 minutes.&lt;/em> The trade-off between 3 and 4 is that 3 is much more common and thus easier to learn about while 4 is more valuable as not only is it an indicator of profit but also has some constraint on risk.&lt;/p>
&lt;p>5 is the one we’ll continue with for this article because it’s similar to 3 and 4 but has mechanics that are easier to follow. The &lt;a href="https://en.wikipedia.org/wiki/VIX">VIX&lt;/a> is sometimes called the Fear Index and it represents how volatile the stocks in the S&amp;amp;P500 are. It is derived by observing the &lt;a href="https://en.wikipedia.org/wiki/Implied_volatility">implied volatility&lt;/a> for specific options on each of the stocks in the index.&lt;/p>
&lt;h3 id="sidenote--why-predict-the-vix">Sidenote — Why predict the VIX&lt;/h3>
&lt;p>What makes the VIX an interesting target is that&lt;/p>
&lt;ol>
&lt;li>It is only one number as opposed to 1000s of stocks. This makes it conceptually easier to follow and reduces computational costs.&lt;/li>
&lt;li>It is the summary of many stocks so most if not all of our inputs are relevant&lt;/li>
&lt;li>It is not a linear combination of our inputs. Implied volatility is extracted from a complicated, non-linear formula stock by stock. The VIX is derived from a complex formula on top of that. If we can predict that, it’s pretty cool.&lt;/li>
&lt;li>It’s tradeable so if this actually works we can use it.&lt;/li>
&lt;/ol>
&lt;h2 id="back-to-our-lstm-outputs-and-the-softmax">Back to our LSTM outputs and the SoftMax&lt;/h2>
&lt;p>How do we use the formulations we saw before to predict changes in the VIX a few minutes in the future? For each point in our dataset, we’ll look what happened to the VIX 5 minutes later. If it went up by more than 1% without going down more than 0.5% during that time we’ll output a 1, otherwise a 0. Then we’ll get a sequence that looks like:&lt;/p>
&lt;blockquote>
&lt;p>0,0,0,0,0,1,1,0,0,0,1,1,0,0,0,0,1,1,1,0,0,0,0,0 ….&lt;/p>
&lt;/blockquote>
&lt;p>We want to take the vector that our LSTMs output and squish it so that it gives us the probability of the next item in our sequence being a 1. The squishing happens in the SoftMax part of the diagram above. (Technically, since we only have 1 class now, we use a sigmoid ).&lt;/p>
&lt;p>So before we get into how this thing learns, let’s recap what we’ve done so far&lt;/p>
&lt;ol>
&lt;li>We take as input a sequence of price data for 1000 stocks&lt;/li>
&lt;li>Each timepoint in the sequence is a snapshot of the market. Our input is a list of 4000 numbers. We use an embedding layer to represent the key information in just 300 numbers.&lt;/li>
&lt;li>Now we have a sequence of embeddings of the market. We put those into a stack of LSTMs, timestep by timestep. The LSTMs remember things from the previous steps and that influences how they process the current one.&lt;/li>
&lt;li>We pass the output of the first layer of LSTMs into another layer. These guys also remember and they learn higher level abstractions of the information we put in.&lt;/li>
&lt;li>Finally, we take the output from all of the LSTMs and “squish them” so that our sequence of market information turns into a sequence of probabilities. The probability in question is “How likely is the VIX to go up 1% in the next 5 minutes without going down 0.5%”&lt;/li>
&lt;/ol>
&lt;h2 id="how-does-this-thing-learn">How does this thing learn?&lt;/h2>
&lt;p>Now the fun part. Everything we did until now was called the forward pass, we’d do all of those steps while we train the algorithm and also when we use it in production. Here we’ll talk about the backward pass, the part we do only while in training that makes our algorithm learn.&lt;/p>
&lt;p>So during training, not only did we prepare years worth of historical data, we also prepared a sequence of prediction targets, that list of 0 and 1 that showed if the VIX moved the way we want it to or not after each observation in our data.&lt;/p>
&lt;p>To learn, we’ll feed the market data to our network and compare its output to what we calculated. Comparing in our case will be simple subtraction, that is we’ll say that our model’s error is&lt;/p>
&lt;blockquote>
&lt;p>ERROR = (((precomputed)— (predicted probability))² )^(1/2)&lt;/p>
&lt;/blockquote>
&lt;p>Or in English, the square root of the square of the difference between what actually happened and what we predicted.&lt;/p>
&lt;p>Here’s the beauty. That’s a differential function, that is, we can tell by how much the error would have changed if our prediction would have changed a little. Our prediction is the outcome of a differentiable function, the SoftMax The inputs to the softmax, the LSTMs are all mathematical functions that are differentiable. Now all of these functions are full of parameters, those big excel spreadsheets I talked about ages ago. So at this stage what we do is take the derivative of the error with respect to every one of the millions of parameters in all of those excel spreadsheets we have in our model. When we do that we can see how the error will change when we change each parameter, so we’ll change each parameter in a way that will reduce the error.&lt;/p>
&lt;p>This procedure propagates all the way to the beginning of the model. It tweaks the way we embed the inputs into MarketVectors so that our MarketVectors represent the most significant information for our task.&lt;/p>
&lt;p>It tweaks when and what each LSTM chooses to remember so that their outputs are the most relevant to our task.&lt;/p>
&lt;p>It tweaks the abstractions our LSTMs learn so that they learn the most important abstractions for our task.&lt;/p>
&lt;p>Which in my opinion is amazing because we have all of this complexity and abstraction that we never had to specify anywhere. It’s all inferred MathaMagically from the specification of what we consider to be an error.&lt;/p>
&lt;p>&lt;img
src="./stochastic-gradient-plot.webp"
alt="Training loss curve illustrating stochastic gradient descent behavior"
loading="lazy"
decoding="async"
class="full-width"
/>
&lt;/p>
&lt;h2 id="whats-next">What’s next&lt;/h2>
&lt;p>Now that I’ve laid this out in writing and it still makes sense to me I want&lt;/p>
&lt;ol>
&lt;li>To see if anyone bothers reading this.&lt;/li>
&lt;li>To fix all of the mistakes my dear readers point out&lt;/li>
&lt;li>Consider if this is still feasible&lt;/li>
&lt;li>And build it&lt;/li>
&lt;/ol>
&lt;p>So, if you’ve come this far please point out my errors and share your inputs.&lt;/p>
&lt;h2 id="other-thoughts">Other thoughts&lt;/h2>
&lt;p>Here are some mostly more advanced thoughts about this project, what other things I might try and why it makes sense to me that this may actually work.&lt;/p>
&lt;h3 id="liquidity-and-efficient-use-of-capital">Liquidity and efficient use of capital&lt;/h3>
&lt;p>Generally the more liquid a particular market is the more efficient that is. I think this is due to a chicken and egg cycle, whereas a market becomes more liquid it is able to absorb more capital moving in and out without that capital hurting itself. As a market becomes more liquid and more capital can be used in it, you’ll find more sophisticated players moving in. This is because it is expensive to be sophisticated, so you need to make returns on a large chunk of capital in order to justify your operational costs.&lt;/p>
&lt;p>A quick corollary is that in less liquid markets the competition isn’t quite as sophisticated and so the opportunities a system like this can bring may not have been traded away. The point being were I to try and trade this I would try and trade it on less liquid segments of the market, that is maybe the TASE 100 instead of the S&amp;amp;P 500.&lt;/p>
&lt;h3 id="this-stuff-is-new">This stuff is new&lt;/h3>
&lt;p>The knowledge of these algorithms, the frameworks to execute them and the computing power to train them are all new at least in the sense that they are available to the average Joe such as myself. I’d assume that top players have figured this stuff out years ago and have had the capacity to execute for as long but, as I mention in the above paragraph, they are likely executing in liquid markets that can support their size. The next tier of market participants, I assume, have a slower velocity of technological assimilation and in that sense, there is or soon will be a race to execute on this in as yet untapped markets.&lt;/p>
&lt;h3 id="multiple-time-frames">Multiple Time Frames&lt;/h3>
&lt;p>While I mentioned a single stream of inputs in the above, I imagine that a more efficient way to train would be to train market vectors (at least) on multiple time frames and feed them in at the inference stage. That is, my lowest time frame would be sampled every 30 seconds and I’d expect the network to learn dependencies that stretch hours at most.&lt;/p>
&lt;p>I don’t know if they are relevant or not but I think there are patterns on multiple time frames and if the cost of computation can be brought low enough then it is worthwhile to incorporate them into the model. I’m still wrestling with how best to represent these on the computational graph and perhaps it is not mandatory to start with.&lt;/p>
&lt;h3 id="marketvectors">MarketVectors&lt;/h3>
&lt;p>When using word vectors in NLP we usually start with a pretrained model and continue adjusting the embeddings during training of our model. In my case, there are no pretrained market vector available nor is tehre a clear algorithm for training them.&lt;/p>
&lt;p>My original consideration was to use an auto-encoder like in &lt;a href="http://cs229.stanford.edu/proj2013/TakeuchiLee-ApplyingDeepLearningToEnhanceMomentumTradingStrategiesInStocks.pdf">this paper&lt;/a> but end to end training is cooler.&lt;/p>
&lt;p>A more serious consideration is the success of sequence to sequence models in translation and speech recognition, where a sequence is eventually encoded as a single vector and then decoded into a different representation (Like from speech to text or from English to French). In that view, the entire architecture I described is essentially the encoder and I haven’t really laid out a decoder.&lt;/p>
&lt;p>But, I want to achieve something specific with the first layer, the one that takes as input the 4000 dimensional vector and outputs a 300 dimensional one. I want it to find correlations or relations between various stocks and compose features about them.&lt;/p>
&lt;p>The alternative is to run each input through an LSTM, perhaps concatenate all of the output vectors and consider that output of the encoder stage. I think this will be inefficient as the interactions and correlations between instruments and their features will be lost, and thre will be 10x more computation required. On the other hand, such an architecture could naively be paralleled across multiple GPUs and hosts which is an advantage.&lt;/p>
&lt;h3 id="cnns">CNNs&lt;/h3>
&lt;p>Recently there has been a spur of papers on character level machine translation. This &lt;a href="https://arxiv.org/pdf/1610.03017v2.pdf">paper&lt;/a> caught my eye as they manage to capture long range dependencies with a convolutional layer rather than an RNN. I haven’t given it more than a brief read but I think that a modification where I’d treat each stock as a channel and convolve over channels first (like in RGB images) would be another way to capture the market dynamics, in the same way that they essentially encode semantic meaning from characters.&lt;/p></description><author/><guid>https://talperry.com/en/posts/classics/dlsm/</guid><pubDate>Sat, 03 Dec 2016 00:00:00 +0000</pubDate></item></channel></rss>