Tal Perry

Five Practical Lessons for Serving Models with Triton Inference Server

Mon, 15 Dec 2025 10:00:00 +0200

Triton Inference Server has become a popular choice for production model serving, and for good reason: it is fast, flexible, and powerful. That said, using Triton effectively requires understanding where it shines—and where it very much does not. This post collects five practical lessons from running Triton in production that I wish I had internalized earlier.

Choose the Right Serving Layer

Not all models belong on Triton. Use vLLM for generative models; use Triton for more traditional inference workloads.

LLMs are everywhere right now, and Triton offers integrations with both TensorRT-LLM and vLLM. At first glance, this makes Triton look like a one-stop shop for serving everything from image classifiers to large language models.

In practice, I’ve found that Triton adds very little on top of a “raw” vLLM deployment. That’s not a knock on Triton—it’s a reflection of how different generative workloads are from classical inference. Many of Triton’s best features simply don’t map cleanly to the way LLMs are served.

A few concrete examples make this clear:

Dynamic batching → Continuous batching Triton’s dynamic batcher waits briefly to group whole requests and then executes them together. This works extremely well for fixed-shape inference. LLM serving, on the other hand, benefits from continuous batching, where new requests are inserted into an active batch as others finish generating tokens. While this is technically possible through Triton’s vLLM backend, it is neither simple nor obvious to operate.

Model packing → Model sharding Triton makes it easy to pack multiple models onto a single GPU to improve utilization. LLMs rarely fit this model. Even modest models tend to consume an entire GPU, and larger ones require sharding across GPUs or even nodes. Triton doesn’t prevent this, but it also doesn’t meaningfully help.

Request caching → Prefix caching Triton’s built-in cache works by storing request–response pairs, which is very effective for deterministic workloads. Generative models instead benefit from caching intermediate state, such as KV caches keyed by shared prompt prefixes. This is a fundamentally different problem and one that LLM-native serving systems handle far more naturally.

In short, I’ve consistently found it dramatically simpler to deploy vLLM directly and immediately benefit from continuous batching, sharding, and prefix caching than to layer Triton on top and wrestle with configuration to achieve similar behavior.

Protect Latency with Server-Side Timeouts

Dynamic batching is Triton’s killer feature. By buffering requests for a short, configurable window and executing them in batch, Triton improves hardware utilization and eliminates a large amount of client-side complexity.

There is, however, an important footgun: by default, Triton will not evict queued requests.

Under load, it is entirely possible for Triton to accumulate a backlog while clients time out and move on. If max_queue_delay_microseconds is not configured, those abandoned requests can sit in the queue and eventually execute, consuming resources while newer requests wait their turn.

The result is perverse but common:

Triton spends time processing requests the client has already given up on.
Latency increases as the queue drains stale work.

This problem is especially acute when using the Python backend. While some native backends can detect client cancellation, the Python backend largely leaves this responsibility to user code. Once a request reaches your execute() method, it will usually run to completion unless you explicitly check for cancellation.

If you care about latency—and you almost certainly do—server-side queue timeouts are not optional.

Keep Client Libraries Minimal

Triton requires clients to know model names, tensor names, shapes, and data types. Exposing this directly to application developers is unpleasant, so providing a small client wrapper is usually worth it.

Where things go wrong is when that wrapper grows ambitions.

I’ve seen (and built) client libraries that try to be helpful by adding retries, backoff, or other resilience features. In practice, this often backfires. Retrying requests that failed due to overload or invalid inputs can amplify traffic precisely when the system is already struggling, turning a transient slowdown into a self-inflicted denial-of-service.

Which is not to say don’t use retries, but rather don’t make them invisible, and allow callers to identify and be identified when retry logic needs to be revistied.

My recommendation is simple: keep client libraries boring. Let them handle request construction and nothing more. Implement retries and error handling at the call site, where the application has the necessary context and observability to do the right thing.

Leverage Triton’s Built-in Cache

Triton’s request–response cache is easy to overlook, but it can be surprisingly effective, especially in cloud environments. GPU instances often come with far more system memory than is otherwise used, and allocating a few extra gigabytes to caching can spare your GPU a significant amount of redundant work.

This is not a blanket recommendation—many workloads won’t benefit—but it is worth experimenting. Watching cache hit rates alongside queue depth can quickly tell you whether caching is helping and whether a particular client is generating unnecessary duplicate traffic.

Prefer ThreadPoolExecutor for Client-Side Parallelism

On the client side, I’ve found that the simplest way to issue parallel inference requests is also the best one: use a thread pool.

In CPython, socket I/O releases the GIL. Since Triton’s HTTP client is primarily I/O-bound, this makes ThreadPoolExecutor an effective and straightforward choice:

def infer(inputs):
 return model_client.infer(inputs=inputs)

with ThreadPoolExecutor(max_workers=8) as pool:
 results = list(pool.map(infer, batch_of_requests))

This approach has a few nice properties:

The client does not need to implement batching logic.
Triton’s dynamic batcher can aggregate requests across threads and even across clients.
Concurrency is naturally bounded, providing a form of backpressure.

Any Python work inside infer remains serialized, which turns out to be a feature rather than a bug: it prevents the client from overwhelming the server while still allowing efficient parallel I/O.

Conclusion

Triton is a powerful serving system, but it is also opinionated. It works best when its abstractions line up with the workload you are trying to serve.

For classical inference workloads, Triton’s batching, scheduling, and caching are hard to beat. For LLMs and other generative models, purpose-built systems like vLLM tend to be a better fit. Understanding this distinction—and configuring Triton defensively when you do use it—goes a long way toward building reliable, low-latency inference systems.

I’m Not the Founder This App Deserves

Tue, 14 May 2024 10:07:22 +0200

Before diving into the reasons behind my decision, it’s essential to know that I am a Jewish Israeli atheist living in Berlin. This background might make you wonder why I would even consider building such an app.

Despite my core identity, I sold a developer tools company two years ago and vowed, “Never again to build a developer tools company.” Instead, I want to pursue something with a well-defined target market and clear value proposition, ideally requiring no outside capital.

Memorizing Christian scripture, a niche within the Faithtech market, initially appeared promising. However, I ultimately concluded that it wasn’t the right fit for me. Here I’ll reflect on how I found the idea to begin with and how I concluded I am not a fit for it.

Initial Motivation

As an immigrant in Germany, learning the local language has been a persistent challenge. I have been using Anki, a tool that employs “spaced repetition,” to expand my German vocabulary.

Discovering Anki’s effectiveness, I came across a heartwarming Reddit story of a parent teaching their child to read using this tool. Inspired, I successfully taught my own five-year-old son to read with Anki. Moreover, I discovered that GenAI allows me to generate large volumes of high-quality content affordably, which would have been prohibitively expensive a few years ago. Using GenAI, I created engaging educational content for my son, like spelling the word “Wurst” (sausage) using images of sausages and producing illustrated and narrated German sentences in YouTube videos.

There is something beautiful about people wanting to internalize the words that shape them. I was intrigued by the possibility of selling this as a product to other parents. However, I realized that the market for educational apps teaching children to read is unappealing. The price point is low, customer acquisition costs are high, regulations are complex, and subscription revenue is challenging to achieve.

Despite these hurdles, I remained interested in the intersection of affordable high-quality content that GenAI enables and memorization algorithms. However, having previously made the mistake of building something and then validating whether someone wanted it, I now sought a problem to solve before developing a solution.

One day, driven by curiosity, I embarked on an endeavor to memorize several chapters of the Old Testament in Hebrew using Anki. Although this could be a product, its Jewish-specific focus limits the potential market due to the smaller global Jewish population.

In contrast, there are many Christians in the U.S. with smartphones and relatively high disposable income. This could be a viable market, so I began exploring scripture memorization for Christians.

I also had to admit that while the market was large and legible, it wasn’t my story to tell, nor an audience I could intuit from lived experience.

The Deep Dive

A few quick Google searches revealed that there are about 200 million Christians in the U.S., with 140 million identifying as evangelicals. While I didn’t fully grasp the significance of this, I knew from social media that evangelicals are devout and willing to invest in their spirituality.

This idea became more appealing when I discovered the wealth of data available about the prospective market. In contrast to my experience with developer tools, where market segmentation was a challenge, here I found detailed Pew Research data on app usage among different denominations, disposable income, and geographic distribution.

With this data, I could effectively target specific market segments, tailoring language, imagery, and marketing strategies accordingly. I became convinced that if people were willing to pay for this solution, I could design effective marketing experiments to scale a sales machine.

The Product Development Hurdle

While scalable marketing is promising, a marketing campaign needs a functioning product to bring to market. What does “functioning’’ mean in this context? For users, it means the app helps them memorize scripture.

However for me, the person who will be investing time and money into building this, a functioning app means an app that converts users into paying customers and retains them.

Viewing a product as a revenue-generating machine complicates the scope of an MVP. It involves appropriate microcopy, correct pricing, delivering a quick “Wow!” moment, and ensuring user retention.

While feasible, it sounds challenging, expensive, and time-consuming. I asked myself a few questions: Could I achieve this without venture capital funding? Probably not. Do I have expertise in creating consumer apps that convert? No. Do I have insights into making the app viral? No.

My enthusiasm waned, and a realization during a conversation with my wife sealed the decision.

The Marketing Challenge

While discussing the idea with my wife in traffic, Maria was singing along to Carrie Underwood’s “Before He Cheats”, where she captures a whole universe with:

“Right now, he’s probably buying her some fruity little drink ‘Cause she can’t shoot a whiskey,”

It highlighted the songwriter’s deep understanding of their audience. Shooting whiskey is an evocative phrase for their audience, but relatively meaningless to me (An Israeli in Berlin where whiskey isn’t a cultural staple). The songwriters knew their audience so well they could intuit evocative phrases like that.

If I were to sell scripture memorization software to American Christians, what could I intuit about them? What relevance or advantage do I have in creating a product that touches on an identity I don’t share?

I realized I wasn’t just missing the marketing language—I was missing the lived context that shapes why scripture memorization matters in the first place.

This issue can be addressed with money. I could hire an agency that specializes in the Christian segment. But without a market-ready product, why invest in marketing? And without a clear marketing strategy, why build the product?

Personal Fit and Market Understanding

Both product and marketing challenges can be solved with time and money. But I had to ask myself, how much time? How much of my life am I willing to dedicate to building and selling scripture memorization software?

Yes, I would like to help people deepen their spirituality. Yes, it would be intellectually stimulating. Yes, it could be lucrative. But I have no personal connection to the product or community. Is this how I want to spend the next 5-10 years of my life?

No, it’s not.

The question wasn’t whether I could build it—but why I would. There are many good problems, but not all of them are mine.

Conclusion

I was initially excited about this opportunity because it leveraged familiar technology, had a large and well-defined market, and seemed potentially lucrative. However, I realized that without a personal advantage in this space, the cost (in time and money) to develop even an MVP was more than I was willing to invest.

Exploration of a market became exploration of identity.

Convolutional Methods for Text

Mon, 22 May 2017 00:00:00 +0000

tl;dr

RNNS work great for text but convolutions can do it faster
Any part of a sentence can influence the semantics of a word. For that reason we want our network to see the entire input at once
Getting that big a receptive can make gradients vanish and our networks fail
We can solve the vanishing gradient problem with DenseNets or Dilated Convolutions
Sometimes we need to generate text. We can use “deconvolutions” to generate arbitrarily long outputs.

Intro

Over the last three years, the field of NLP has gone through a huge revolution thanks to deep learning. The leader of this revolution has been the recurrent neural network and particularly its manifestation as an LSTM. Concurrently the field of computer vision has been reshaped by convolutional neural networks. This post explores what we “text people” can learn from our friends who are doing vision.

Common NLP Tasks

To set the stage and agree on a vocabulary, I’d like to introduce a few of the more common tasks in NLP. For the sake of consistency, I’ll assume that all of our model’s inputs are characters and that our “unit of observation” is a sentence. Both of these assumptions are just for the sake of convenience and you can replace characters with words and sentences with documents if you so wish.

Classification

Perhaps the oldest trick in the book, we often want to classify a sentence. For example, we might want to classify an email subject as indicative of spam, guess the sentiment of a product review or assign a topic to a document.

The straightforward way to handle this kind of task with an RNN is to feed entire sentence into it, character by character, and then observe the RNNs final hidden state.

Sequence Labeling

Sequence labeling tasks are tasks that return an output for each input. Examples include part of speech labeling or entity recognition tasks. While the bare bones LSTM model is far from the state of the art, it is easy to implement and offers compelling results. See this paper for a more fleshed out architecture

Sequence Generation

Arguably the most impressive results in recent NLP have been in translation. Translation is a mapping of one sequence to another, with no guarantees on the length of the output sentence. For example, translating the first words of the Bible from Hebrew to English is בראשית = “In the Beginning”.

At the core of this success is the Sequence to Sequence (AKA encoder decoder) framework, a methodology to “compress” a sequence into a code and then decode it to another sequence. Notable examples include translation (Encode Hebrew and decode to English), image captioning (Encode an Image and decode a textual description of its contents)

The basic Encoder step is similar to the scheme we described for classification. What’s amazing is that we can build a decoder that learns to generate arbitrary length outputs.

The two examples above are really both translation, but sequence generation is a bit broader than that. OpenAI recently published a paper where they learn to generate “Amazon Reviews” while controlling the sentiment of the output

Another personal favorite is the paper Generating Sentences from a Continuous Space. In that paper, they trained a variational autoencoder on text, which led to the ability to interpolate between two sentences and get coherent results.

Requirements from an NLP architecture

What all of the implementations we looked at have in common is that they use a recurrent architecture, usually an LSTM (If your not sure what that is, here is a great intro) . It is worth noting that none of the tasks had recurrent in their name, and none mentioned LSTMs. With that in mind, lets take a moment to think what RNNs and particularly LSTMs provide that make them so ubiquitous in NLP.

Arbitrary Input Size

A standard feed forward neural network has a parameter for every input. This becomes problematic when dealing with text or images for a few reasons.

It restricts the input size we can handle. Our network will have a finite number of input nodes and won’t be able to grow to more.
We lose a lot of common information. Consider the sentences “I like to drink beer a lot” and “I like to drink a lot of beer”. A feed forward network would have to learn about the concept of “a lot” twice as it appears in different input nodes each time.

Recurrent neural networks solve this problem. Instead of having a node for each input, we have a big “box” of nodes that we apply to the input again and again. The “box” learns a sort of transition function, which means that the outputs follow some recurrence relation, hence the name.

Remember that the vision people got a lot of the same effect for images using convolutions. That is, instead of having an input node for each pixel, convolutions allowed the reuse of the same, small set of parameters across the entire image.

Long Term Dependencies

The promise of RNNs is their ability to implicitly model long term dependencies. The picture below is taken from OpenAI. They trained a model that ended up recognizing sentiment and colored the text, character by character, with the model’s output. Notice how the model sees the word “best” and triggers a positive sentiment which it carries on for over 100 characters. That’s capturing a long range dependency.

The theory of RNNs promises us long range dependencies out of the box. The practice is a little more difficult. When we learn via backpropagation, we need to propagate the signal through the entire recurrence relation. The thing is, at every step we end up multiplying by a number. If those numbers are generally smaller than 1, our signal will quickly go to 0. If they are larger than 1, then our signal will explode.

These issues are called the vanishing and exploding gradient and are generally resolved by LSTMs and a few clever tricks. I mention them know because we’ll encounter these problems again with convolutions and will need another way to address them.

Advantages of convolutions

So far we’ve seen how great LSTMs are, but this post is about convolutions. In the spirit of don’t fix what ain’t broken, we have to ask ourselves why we’d want to use convolutions at all.

One answer is “ Because we can”.

But there are two other compelling reasons to use convolutions, speed, and context.

Parrelalisation

RNNs operate sequentially, the output for the second input depends on the first one and so we can’t parallelise an RNN. Convolutions have no such problem, each “patch” a convolutional kernel operates on is independent of the other meaning that we can go over the entire input layer concurrently.

There is a price to pay for this, as we’ll see we have to stack convolutions into deep layers in order to view the entire input and each of those layers is calculated sequentially. But the calculations at each layer happen concurrently and each individual computation is small (compared to an LSTM) such that in practice we get a big speed up.

When I set out to write this I only had my own experience and Google’s ByteNet to back this claim up. Just this week, Facebook published their fully convolutional translation model and reported a 9X speed up over LSTM based models.

View the whole input at once

LSTMs read their input from left to right (or right to left) but sometimes we’d like to have the context of the end of the sentence influence the networks thoughts about its begining. For example, we might have a sentence like “I’d love to buy your product. Not!” and we’d like that negation at the end to influence the entire sentence.

With LSTMs we achieve this by running two LSTMs, one left to right and the other right to left and concatenating their outputs. This works well in practice but doubles our computational load.

Convolutions, on the other hand, grow a larger “receptive field” as we stack more and more layers. That means that by default, each “step” in the convolution’s representation views all of the input in its receptive field, from before and after it. I’m not aware of any definitive argument that this is inherently better than an LSTM, but it does give us the desired effect in a controllable fashion and with a low computational cost.

So far we’ve set up our problem domain and talked a bit about the conceptual advantages of convolutions for NLP. From here out, I’d like to translate those concepts into practical methods that we can use to analyze and construct our networks.

Practical convolutions for text

You’ve probably seen an animation like the one above illustrating what a convolution does. The bottom is an input image, the top is the result and the gray shadow is the convolutional kernel which is repeatedly applied.

This all makes perfect sense except that the input described in the picture is an image, with two spatial dimensions (height and width). We’re talking about text, which has only one dimension, and it’s temporal not spatial.

For all practical purposes, that doesn’t matter. We just need to think of our text as an image of width n and height 1. Tensorflow provides a conv1d function that does that for us, but it does not expose other convolutional operations in their 1d version.

To make the “Text = an image of height 1” idea concrete, let’s see how we’d use the 2d convolutional op in Tensorflow on a sequence of embedded tokens.

So what we’re doing here is changing the shape of input with tf.expand_dims so that it becomes an “Image of height 1”. After running the 2d convolution operator we squeeze away the extra dimension.

Hierarchy and Receptive Fields

Many of us have seen pictures like the one above. It roughly shows the hierarchy of abstractions a CNN learns on images. In the first layer, the network learns basic edges. In the next layer, it combines those edges to learn more abstract concepts like eyes and noses. Finally, it combines those to recognize individual faces.

With that in mind, we need to remember that each layer doesn’t just learn more abstract combinations of the previous layer. Successive layers, implicitly or explicitly, see more of the input

Increasing Receptive Field

With vision often we’ll want the network to identify one or more objects in the picture while ignoring others. That is, we’ll be interested in some local phenomenon but not in a relationship that spans the entire input.

Text is more subtle as often we’ll want intermediate representations of our data to carry as much context about their surroundings as they possibly can. In other words, we want to have as large a receptive field as possible. Their are a few ways to go about this.

Larger Filters

The first, most obvious, way is to increase the filter size, that is doing a [1x5] convolution instead of a [1x3]. In my work with text, I’ve not had great results with this and I’ll offer my speculations as to why.

In my domain, I mostly deal with character level inputs and with texts that are morphologicaly very rich. I think of (at least the first) layers of convolution as learning n-grams so that the width of the filter corresponds to bigrams, trigrams etc. Having the network learn larger n-grams early exposes it to fewer examples, as there are more occurrences of “ab” in a text than “abb”.

I’ve never proved this interpretation but have gotten consistently poorer results with filter widths larger than 3.

Adding Layers

As we saw in the picture above, adding more layers will increase the receptive field. Dang Ha The Hien wrote a great guide to calculating the receptive field at each layer which I encourage you to read.

Adding layers has two distinct but related effects. The one that gets thrown around a lot is that the model will learn to make higher level abstractions over the inputs that it gets (Pixels =>Edges => Eyes => Face). The other is that the receptive field grows at each step .

This means that given enough depth, our network could look at the entire input layer though perhaps through a haze of abstractions. Unfortunately this is where the vanishing gradient problem may rear its ugly head.

The Gradient / Receptive field trade off

Neural networks are networks that information flows through. In the forward pass our input flows and transforms, hopefully becoming a representation that is more amenable to our task. During the back phase we propagate a signal, the gradient, back through the network. Just like in vanilla RNNs, that signal gets multiplied frequently and if it goes through a series of numbers that are smaller than 1 then it will fade to 0. That means that our network will end up with very little signal to learn from.

This leaves us with something of a tradeoff. On the one hand, we’d like to be able to take in as much context as possible. On the other hand, if we try to increase our receptive fields by stacking layers we risk vanishing gradients and a failure to learn anything.

Two Solutions to the Vanishing Gradient Problem

Luckily, many smart people have been thinking about these problems. Luckier still, these aren’t problems that are unique to text, the vision people also want larger receptive fields and information rich gradients. Let’s take a look at some of their crazy ideas and use them to further our own textual glory.

Residual Connections

2016 was another great year for the vision people with at least two very popular architectures emerging, ResNets and DenseNets (The DenseNet paper, in particular, is exceptionally well written and well worth the read) . Both of them address the same problem “How do I make my network very deep without losing the gradient signal?”

Arthur Juliani wrote a fantastic overview of Resnet, DenseNets and Highway networks for those of you looking for the details and comparison. I’ll briefly touch on DenseNets which take the core concept to its extreme.

The general idea is to reduce the distance between the signal coming from the networks loss and each individual layer. The way this is done is by adding a residual/direct connection between every layer and its predecessors. That way, the gradient can flow from each layer to its predecessors directly.

DenseNets do this in a particularly interesting way. They concatenate the output of each layer to its input such that:

We start with an embedding of our inputs, say of dimension 10.
Our first layer calculates 10 feature maps. It outputs the 10 feature maps concatenated to the original embedding.
The second layer gets as input 20 dimensional vectors (10 from the input and 10 from the previous layer) and calculates another 10 feature maps. Thus it outputs 30 dimensional vectors.

And so on and so on for as many layers as you’d like. The paper describes a boat load of tricks to make things manageable and efficient but that’s the basic premise and the vanishing gradient problem is solved.

There are two other things I’d like to point out.

I previously mentioned that upper layers have a view of the original input that may be hazed by layers of abstraction. One of the highlights of concatenating the outputs of each layer is that the original signal reaches the following layers intact, so that all layers have a direct view of lower level features, essentially removing some of the haze.
The Residual connection trick requires that all of our layers have the same shape. That means that we need to pad each layer so that its input and output have the same spatial dimensions [1Xwidth]. That means that, on its own, this kind of architecture will work for sequence labeling tasks (Where the input and the output have the same spatial dimensions) but will need more work for encoding and classification tasks (Where we need to reduce the input to a fixed size vector or set of vectors). The DenseNet paper actually handles this as their goal is to do classification and we’ll expand on this point later.

Dilated Convolutions

Dilated convolutions AKA atrous convolutions AKA convolutions with holes are another method of increasing the receptive field without angering the gradient gods. When we looked at stacking layers so far, we saw that the receptive field grows linearly with depth. Dilated convolutions let us grow the receptive field exponentially with depth.

You can find an almost accessible explanation of dilated convolutions in the paper Multi scale context aggregation by dilated convolutions which uses them for vision. While conceptually simple, it took me a while to understand exactly what they do, and I may still have it not quite right.

The basic idea is to introduce “holes” into each filter, so that it doesn’t operate on adjacent parts of the input but rather skips over them to parts further away. Note that this is different from applying a convolution with stride >1. When we stride a filter, we skip over parts of the input between applications of the convolution. With dilated convolutions, we skip over parts of the input within a single application of the convolution. By cleverly arranging growing dilations we can achieve the promised exponential growth in receptive fields.

We’ve talked a lot of theory so far, but we’re finally at a point where we can see this stuff in action!

A personal favorite paper is Neural Machine Translation in Linear Time. It follows the encoder decoder structure we talked about in the beginning. We still don’t have all the tools to talk about the decoder, but we can see the encoder in action.

And here’s an English input

Director Jon Favreau, who is currently working on Disney’s forthcoming Jungle Book film, told the website Hollywood Reporter: “I think times are changing.”

And its translation, brought to you by dilated convolutions

Regisseur Jon Favreau, der zur Zeit an Disneys kommendem Jungle Book Film arbeitet, hat der Website Hollywood Reporter gesagt: “Ich denke, die Zeiten andern sich”.

And as a bonus, remember that sound is just like text, in the sense that it has just one spatial/temporal dimension. Check out DeepMind’s Wavenet which uses dilated convolutions (and a lot of other magic) to generate human sounding speech and piano music.

Getting Stuff Out of your network

When we discussed DenseNets I mentioned that the use of residual connections forces us to keep the input and output length of our sequence the same, which is done via padding. This is great for tasks where we need to label each item in our sequence for example:

In parts of speech tagging where each word is a part of speech.
In entity recognition where we might label Person, Company, and Other for everything else

Other times we’ll want to reduce our input sequence down to a vector representation and use that to predict something about the entire sentence

We might want to label an email as spam based on its content and or subject
Predict if a certain sentence is sarcastic or not

In these cases, we can follow the traditional approaches of the vision people and top off our network with convolutional layers that don’t have padding and/or use pooling operations.

But sometimes we’ll want to follow the Seq2Seq paradigm, what Matthew Honnibal succinctly called Embed, encode, attend, predict. In this case, we reduce our input down to some vector representation but then need to somehow up sample that vector back to a sequence of the proper length.

This task entails two problems

How do we do upsampling with convolutions ?
How do we do exactly the right amount of up sampling?

I still haven’t found the answer to the second question or at least have not yet understood it. In practice, it’s been enough for me to assume some upper bound on the maximum length of the output and then upsample to that point. I suspect Facebooks new translation paper may address this but have not yet read it deeply enough to comment.

Upsampling with deconvolutions

Deconvolutions are our tool for upsampling. It’s easiest (for me) to understand what they do through visualizations. Luckily, a few smart folks published a great post on deconvolutions over at Distill and included some fun visualizers. Lets start with those.

Consider the image on top. If we take the bottom layer as the input we have a standard convolution of stride 1 and width 3. But, we can also go from top down, that is treat the top layer as the input and get the slightly larger bottom layer.

If you stop to think about that for a second, this “top down” operation is already happening in your convolutional networks when you do back propagation, as the gradient signals need to propagate in exactly the way shown in the picture. Even better, it turns out that this operation is simply the transpose of the convolution operation, hence the other common (and technically correct) name for this operation, transposed convolution.

Here’s where it gets fun. We can stride our convolutions to shrink our input. Thus we can stride our deconvolutions to grow our input. I think the easiest way to understand how strides work with deconvolutions is to look at the following pictures.

We’ve already seen the top one. Notice that each input (the top layer) feeds three of the outputs and that each of the outputs is fed by three inputs (except the edges).

In the second picture we place imaginary holes in our inputs. Notice that now each of the outputs is fed by at most two inputs.

In the third picture we’ve added two imaginary holes into out input layer and so each output is fed by exactly one input. This ends up tripling the sequence length of our output with respect to the sequence length of our input.

Finally, we can stack multiple deconvolutional layers to gradually grow our output layer to the desired size.

A few things worth thinking about

If you look at these drawings from bottom up, they end up being standard strided convolutions where we just added imaginary holes at the output layers (The white blocks)
In practice, each “input” isn’t a single number but a vector. In the image world, it might be a 3 dimensional RGB value. In text it might be a 300 dimensional word embedding. If you’re (de)convolving in the middle of your network each point would be a vector of whatever size came out of the last layer.
I point that out to convince you that their is enough information in the input layer of a deconvolution to spread across a few points in the output.
In practice, I’ve had success running a few convolutions with length preserving padding after a deconvolution. I imagine, though haven’t proven, that this acts like a redistribution of information. I think of it like letting a steak rest after grilling to let the juices redistribute.

Summary

The main reason you might want to consider convolutions in your work is because they are fast. I think that’s important to make research and exploration faster and more efficient. Faster networks shorten our feedback cycles.

Most of the tasks I’ve encountered with text end up having the same requirement of the architecture: Maximize the receptive field while maintaining an adequate flow of gradients. We’ve seen the use of both DenseNets and dilated convolutions to achieve that.

Finally, sometimes we want to expand a sequence or a vector into a larger sequence. We looked at deconvolutions as a way to do “upsampling” on text and as a bonus compared adding a convolution afterwards the letting a steak rest and redistribute its juices.

I’d love to learn more about your thoughts and experiences with these kinds of models. Share in the comments or ping me on twitter @thetalperry

Deep Learning The Stock Market

Sat, 03 Dec 2016 00:00:00 +0000

Update 15.03.2024 I wrote this more than seven years ago. My understanding has evolved since then, and the world of deep learning has gone through more than one revolution since. It was popular back in the day and might still be a fun read though you might learn more accurate and upto date information somewhere else

Update 25.1.17 — Took me a while but here is an ipython notebook with a rough implementation

Why NLP is relevant to Stock prediction

In many NLP problems we end up taking a sequence and encoding it into a single fixed size representation, then decoding that representation into another sequence. For example, we might tag entities in the text, translate from English to French or convert audio frequencies to text. There is a torrent of work coming out in these areas and a lot of the results are achieving state of the art performance.

In my mind the biggest difference between the NLP and financial analysis is that language has some guarantee of structure, it’s just that the rules of the structure are vague. Markets, on the other hand, don’t come with a promise of a learnable structure, that such a structure exists is the assumption that this project would prove or disprove (rather it might prove or disprove if I can find that structure).

Assuming the structure is there, the idea of summarizing the current state of the market in the same way we encode the semantics of a paragraph seems plausible to me. If that doesn’t make sense yet, keep reading. It will.

You shall know a word by the company it keeps (Firth, J. R. 1957:11)

There is tons of literature on word embeddings. Richard Socher’s lecture is a great place to start. In short, we can make a geometry of all the words in our language, and that geometry captures the meaning of words and relationships between them. You may have seen the example of “King-man +woman=Queen” or something of the sort.

Embeddings are cool because they let us represent information in a condensed way. The old way of representing words was holding a vector (a big list of numbers) that was as long as the number of words we know, and setting a 1 in a particular place if that was the current word we are looking at. That is not an efficient approach, nor does it capture any meaning. With embeddings, we can represent all of the words in a fixed number of dimensions (300 seems to be plenty, 50 works great) and then leverage their higher dimensional geometry to understand them.

The picture below shows an example. An embedding was trained on more or less the entire internet. After a few days of intensive calculations, each word was embedded in some high dimensional space. This “space” has a geometry, concepts like distance, and so we can ask which words are close together. The authors/inventors of that method made an example. Here are the words that are closest to Frog.

But we can embed more than just words. We can do, say , stock market embeddings.

Market2Vec

The first word embedding algorithm I heard about was word2vec. I want to get the same effect for the market, though I’ll be using a different algorithm. My input data is a csv, the first column is the date, and there are 4*1000 columns corresponding to the High Low Open Closing price of 1000 stocks. That is my input vector is 4000 dimensional, which is too big. So the first thing I’m going to do is stuff it into a lower dimensional space, say 300 because I liked the movie.

Taking something in 4000 dimensions and stuffing it into a 300-dimensional space my sound hard but its actually easy. We just need to multiply matrices. A matrix is a big excel spreadsheet that has numbers in every cell and no formatting problems. Imagine an excel table with 4000 columns and 300 rows, and when we basically bang it against the vector a new vector comes out that is only of size 300. I wish that’s how they would have explained it in college.

The fanciness starts here as we’re going to set the numbers in our matrix at random, and part of the “deep learning” is to update those numbers so that our excel spreadsheet changes. Eventually this matrix spreadsheet (I’ll stick with matrix from now on) will have numbers in it that bang our original 4000 dimensional vector into a concise 300 dimensional summary of itself.

We’re going to get a little fancier here and apply what they call an activation function. We’re going to take a function, and apply it to each number in the vector individually so that they all end up between 0 and 1 (or 0 and infinity, it depends). Why ? It makes our vector more special, and makes our learning process able to understand more complicated things. How?

So what? What I’m expecting to find is that the new embedding of the market prices (the vector) into a smaller space captures all the essential information for the task at hand, without wasting time on the other stuff. So I’d expect they’d capture correlations between other stocks, perhaps notice when a certain sector is declining or when the market is very hot. I don’t know what traits it will find, but I assume they’ll be useful.

Now What

Lets put aside our market vectors for a moment and talk about language models. Andrej Karpathy wrote the epic post “The Unreasonable effectiveness of Recurrent Neural Networks”. If I’d summarize in the most liberal fashion the post boils down to

If we look at the works of Shakespeare and go over them character by character, we can use “deep learning” to learn a language model.
A language model (in this case) is a magic box. You put in the first few characters and it tells you what the next one will be.
If we take the character that the language model predicted and feed it back in we can keep going forever.

And then as a punchline, he generated a bunch of text that looks like Shakespeare. And then he did it again with the Linux source code. And then again with a textbook on Algebraic geometry.

So I’ll get back to the mechanics of that magic box in a second, but let me remind you that we want to predict the future market based on the past just like he predicted the next word based on the previous one. Where Karpathy used characters, we’re going to use our market vectors and feed them into the magic black box. We haven’t decided what we want it to predict yet, but that is okay, we won’t be feeding its output back into it either.

Going deeper

I want to point out that this is where we start to get into the deep part of deep learning. So far we just have a single layer of learning, that excel spreadsheet that condenses the market. Now we’re going to add a few more layers and stack them, to make a “deep” something. That’s the deep in deep learning.

So Karpathy shows us some sample output from the Linux source code, this is stuff his black box wrote.

static void action_new_function(struct s_stat_info *wb)
{
 unsigned long flags;
 int lel_idx_bit = e->edd, *sys & ~((unsigned long) *FIRST_COMPAT);
 buf[0] = 0xFFFFFFFF & (bit << 4);
 min(inc, slist->bytes);
 printk(KERN_WARNING "Memory allocated %02x/%02x, "
 "original MLL instead\n"),
 min(min(multi_run - s->len, max) * num_data_in),
 frame_pos, sz + first_seg);
 div_u64_w(val, inb_p);
 spin_unlock(&disk->queue_lock);
 mutex_unlock(&s->sock->mutex);
 mutex_unlock(&func->mutex);
 return disassemble(info->pending_bh);
}

Notice that it knows how to open and close parentheses, and respects indentation conventions; The contents of the function are properly indented and the multi-line printk statement has an inner indentation. That means that this magic box understands long range dependencies. When it’s indenting within the print statement it knows it’s in a print statement and also remembers that it’s in a function( Or at least another indented scope). That’s nuts. It’s easy to gloss over that but an algorithm that has the ability to capture and remember long term dependencies is super useful because… We want to find long term dependencies in the market.

Inside the magical black box

What’s inside this magical black box? It is a type of Recurrent Neural Network (RNN) called an LSTM. An RNN is a deep learning algorithm that operates on sequences (like sequences of characters). At every step, it takes a representation of the next character (Like the embeddings we talked about before) and operates on the representation with a matrix, like we saw before. The thing is, the RNN has some form of internal memory, so it remembers what it saw previously. It uses that memory to decide how exactly it should operate on the next input. Using that memory, the RNN can “remember” that it is inside of an intended scope and that is how we get properly nested output text.

A fancy version of an RNN is called a Long Short Term Memory (LSTM). LSTM has cleverly designed memory that allows it to

Selectively choose what it remembers
Decide to forget
Select how much of it’s memory it should output.

So an LSTM can see a “{“ and say to itself “Oh yeah, that’s important I should remember that” and when it does, it essentially remembers an indication that it is in a nested scope. Once it sees the corresponding “}” it can decide to forget the original opening brace and thus forget that it is in a nested scope.

We can have the LSTM learn more abstract concepts by stacking a few of them on top of each other, that would make us “Deep” again. Now each output of the previous LSTM becomes the inputs of the next LSTM, and each one goes on to learn higher abstractions of the data coming in. In the example above (and this is just illustrative speculation), the first layer of LSTMs might learn that characters separated by a space are “words”. The next layer might learn word types like (**static** **void** **action_new_function).**The next layer might learn the concept of a function and its arguments and so on. It’s hard to tell exactly what each layer is doing, though Karpathy’s blog has a really nice example of how he did visualize exactly that.

Connecting Market2Vec and LSTMs

The studious reader will notice that Karpathy used characters as his inputs, not embeddings (Technically a one-hot encoding of characters). But, Lars Eidnes actually used word embeddings when he wrote Auto-Generating Clickbait With Recurrent Neural Network

The figure above shows the network he used. Ignore the SoftMax part (we’ll get to it later). For the moment, check out how on the bottom he puts in a sequence of words vectors at the bottom and each one. (Remember, a “word vector” is a representation of a word in the form of a bunch of numbers, like we saw in the beginning of this post). Lars inputs a sequence of Word Vectors and each one of them:

Influences the first LSTM
Makes it’s LSTM output something to the LSTM above it
Makes it’s LSTM output something to the LSTM for the next word

We’re going to do the same thing with one difference, instead of word vectors we’ll input “MarketVectors”, those market vectors we described before. To recap, the MarketVectors should contain a summary of what’s happening in the market at a given point in time. By putting a sequence of them through LSTMs I hope to capture the long term dynamics that have been happening in the market. By stacking together a few layers of LSTMs I hope to capture higher level abstractions of the market’s behavior.

What Comes out

Thus far we haven’t talked at all about how the algorithm actually learns anything, we just talked about all the clever transformations we’ll do on the data. We’ll defer that conversation to a few paragraphs down, but please keep this part in mind as it is the se up for the punch line that makes everything else worthwhile.

In Karpathy’s example, the output of the LSTMs is a vector that represents the next character in some abstract representation. In Eidnes’ example, the output of the LSTMs is a vector that represents what the next word will be in some abstract space. The next step in both cases is to change that abstract representation into a probability vector, that is a list that says how likely each character or word respectively is likely to appear next. That’s the job of the SoftMax function. Once we have a list of likelihoods we select the character or word that is the most likely to appear next.

In our case of “predicting the market”, we need to ask ourselves what exactly we want to market to predict? Some of the options that I thought about were:

Predict the next price for each of the 1000 stocks
Predict the value of some index (S&P, VIX etc) in the next n minutes.
Predict which of the stocks will move up by more than x% in the next n minutes
(My personal favorite) Predict which stocks will go up/down by 2x% in the next n minutes while not going down/up by more than x% in that time.
(The one we’ll follow for the remainder of this article). Predict when the VIX will go up/down by 2x% in the next n minutes while not going down/up by more than x% in that time.

1 and 2 are regression problems, where we have to predict an actual number instead of the likelihood of a specific event (like the letter n appearing or the market going up). Those are fine but not what I want to do.

3 and 4 are fairly similar, they both ask to predict an event (In technical jargon — a class label). An event could be the letter n appearing next or it could be Moved up 5% while not going down more than 3% in the last 10 minutes. The trade-off between 3 and 4 is that 3 is much more common and thus easier to learn about while 4 is more valuable as not only is it an indicator of profit but also has some constraint on risk.

5 is the one we’ll continue with for this article because it’s similar to 3 and 4 but has mechanics that are easier to follow. The VIX is sometimes called the Fear Index and it represents how volatile the stocks in the S&P500 are. It is derived by observing the implied volatility for specific options on each of the stocks in the index.

Sidenote — Why predict the VIX

What makes the VIX an interesting target is that

It is only one number as opposed to 1000s of stocks. This makes it conceptually easier to follow and reduces computational costs.
It is the summary of many stocks so most if not all of our inputs are relevant
It is not a linear combination of our inputs. Implied volatility is extracted from a complicated, non-linear formula stock by stock. The VIX is derived from a complex formula on top of that. If we can predict that, it’s pretty cool.
It’s tradeable so if this actually works we can use it.

Back to our LSTM outputs and the SoftMax

How do we use the formulations we saw before to predict changes in the VIX a few minutes in the future? For each point in our dataset, we’ll look what happened to the VIX 5 minutes later. If it went up by more than 1% without going down more than 0.5% during that time we’ll output a 1, otherwise a 0. Then we’ll get a sequence that looks like:

0,0,0,0,0,1,1,0,0,0,1,1,0,0,0,0,1,1,1,0,0,0,0,0 ….

We want to take the vector that our LSTMs output and squish it so that it gives us the probability of the next item in our sequence being a 1. The squishing happens in the SoftMax part of the diagram above. (Technically, since we only have 1 class now, we use a sigmoid ).

So before we get into how this thing learns, let’s recap what we’ve done so far

We take as input a sequence of price data for 1000 stocks
Each timepoint in the sequence is a snapshot of the market. Our input is a list of 4000 numbers. We use an embedding layer to represent the key information in just 300 numbers.
Now we have a sequence of embeddings of the market. We put those into a stack of LSTMs, timestep by timestep. The LSTMs remember things from the previous steps and that influences how they process the current one.
We pass the output of the first layer of LSTMs into another layer. These guys also remember and they learn higher level abstractions of the information we put in.
Finally, we take the output from all of the LSTMs and “squish them” so that our sequence of market information turns into a sequence of probabilities. The probability in question is “How likely is the VIX to go up 1% in the next 5 minutes without going down 0.5%”

How does this thing learn?

Now the fun part. Everything we did until now was called the forward pass, we’d do all of those steps while we train the algorithm and also when we use it in production. Here we’ll talk about the backward pass, the part we do only while in training that makes our algorithm learn.

So during training, not only did we prepare years worth of historical data, we also prepared a sequence of prediction targets, that list of 0 and 1 that showed if the VIX moved the way we want it to or not after each observation in our data.

To learn, we’ll feed the market data to our network and compare its output to what we calculated. Comparing in our case will be simple subtraction, that is we’ll say that our model’s error is

ERROR = (((precomputed)— (predicted probability))² )^(1/2)

Or in English, the square root of the square of the difference between what actually happened and what we predicted.

Here’s the beauty. That’s a differential function, that is, we can tell by how much the error would have changed if our prediction would have changed a little. Our prediction is the outcome of a differentiable function, the SoftMax The inputs to the softmax, the LSTMs are all mathematical functions that are differentiable. Now all of these functions are full of parameters, those big excel spreadsheets I talked about ages ago. So at this stage what we do is take the derivative of the error with respect to every one of the millions of parameters in all of those excel spreadsheets we have in our model. When we do that we can see how the error will change when we change each parameter, so we’ll change each parameter in a way that will reduce the error.

This procedure propagates all the way to the beginning of the model. It tweaks the way we embed the inputs into MarketVectors so that our MarketVectors represent the most significant information for our task.

It tweaks when and what each LSTM chooses to remember so that their outputs are the most relevant to our task.

It tweaks the abstractions our LSTMs learn so that they learn the most important abstractions for our task.

Which in my opinion is amazing because we have all of this complexity and abstraction that we never had to specify anywhere. It’s all inferred MathaMagically from the specification of what we consider to be an error.

What’s next

Now that I’ve laid this out in writing and it still makes sense to me I want

To see if anyone bothers reading this.
To fix all of the mistakes my dear readers point out
Consider if this is still feasible
And build it

So, if you’ve come this far please point out my errors and share your inputs.

Other thoughts

Here are some mostly more advanced thoughts about this project, what other things I might try and why it makes sense to me that this may actually work.

Liquidity and efficient use of capital

Generally the more liquid a particular market is the more efficient that is. I think this is due to a chicken and egg cycle, whereas a market becomes more liquid it is able to absorb more capital moving in and out without that capital hurting itself. As a market becomes more liquid and more capital can be used in it, you’ll find more sophisticated players moving in. This is because it is expensive to be sophisticated, so you need to make returns on a large chunk of capital in order to justify your operational costs.

A quick corollary is that in less liquid markets the competition isn’t quite as sophisticated and so the opportunities a system like this can bring may not have been traded away. The point being were I to try and trade this I would try and trade it on less liquid segments of the market, that is maybe the TASE 100 instead of the S&P 500.

This stuff is new

The knowledge of these algorithms, the frameworks to execute them and the computing power to train them are all new at least in the sense that they are available to the average Joe such as myself. I’d assume that top players have figured this stuff out years ago and have had the capacity to execute for as long but, as I mention in the above paragraph, they are likely executing in liquid markets that can support their size. The next tier of market participants, I assume, have a slower velocity of technological assimilation and in that sense, there is or soon will be a race to execute on this in as yet untapped markets.

Multiple Time Frames

While I mentioned a single stream of inputs in the above, I imagine that a more efficient way to train would be to train market vectors (at least) on multiple time frames and feed them in at the inference stage. That is, my lowest time frame would be sampled every 30 seconds and I’d expect the network to learn dependencies that stretch hours at most.

I don’t know if they are relevant or not but I think there are patterns on multiple time frames and if the cost of computation can be brought low enough then it is worthwhile to incorporate them into the model. I’m still wrestling with how best to represent these on the computational graph and perhaps it is not mandatory to start with.

MarketVectors

When using word vectors in NLP we usually start with a pretrained model and continue adjusting the embeddings during training of our model. In my case, there are no pretrained market vector available nor is tehre a clear algorithm for training them.

My original consideration was to use an auto-encoder like in this paper but end to end training is cooler.

A more serious consideration is the success of sequence to sequence models in translation and speech recognition, where a sequence is eventually encoded as a single vector and then decoded into a different representation (Like from speech to text or from English to French). In that view, the entire architecture I described is essentially the encoder and I haven’t really laid out a decoder.

But, I want to achieve something specific with the first layer, the one that takes as input the 4000 dimensional vector and outputs a 300 dimensional one. I want it to find correlations or relations between various stocks and compose features about them.

The alternative is to run each input through an LSTM, perhaps concatenate all of the output vectors and consider that output of the encoder stage. I think this will be inefficient as the interactions and correlations between instruments and their features will be lost, and thre will be 10x more computation required. On the other hand, such an architecture could naively be paralleled across multiple GPUs and hosts which is an advantage.

CNNs

Recently there has been a spur of papers on character level machine translation. This paper caught my eye as they manage to capture long range dependencies with a convolutional layer rather than an RNN. I haven’t given it more than a brief read but I think that a modification where I’d treat each stock as a channel and convolve over channels first (like in RGB images) would be another way to capture the market dynamics, in the same way that they essentially encode semantic meaning from characters.