Machines Learned to Read. Now, Can They Stand to Reason?

Evan O'Donnell

Nov 177 min read

Machines Learned to Read. Now, Can They Stand to Reason?

This issue explores how intelligent, application-layer architecture in can create value beyond foundational AI models.

Special thanks to Jared White, CEO and co-founder of Matey.ai, a Timespan portfolio company, for providing feedback on this article. Matey is pushing the boundaries of what’s possible with unstructured data through its intelligent software.

01 | How machines learned to read

At Timespan Ventures, we think about technology on a timeline, through a historical lens.

This helps (i) filter out incremental products, those recycling old primitives; (ii) anticipate new software paradigms; and (iii) visualize how those breakthroughs need to be productized to unlock new categories of value.

One shift we are investing behind is the ability for AI to process unstructured, multimodal data. [1]

Early machine learning systems like recurrent neural networks (RNNs) processed data sequentially, analyzing each component in a fixed order. This limited their ability to retain information and understand complex relationships across large data sets.

To overcome this, data had to be pre-formatted and fed to computers in a specific order.

However, in the 2010s, two major breakthroughs in parallel processing [2] enabled computers to move beyond these constraints:

Unlocking scale: GPUs, originally designed for graphic rendering in gaming, were adapted to process large datasets using thousands of parallel cores.[3] This new chip design allowed developers to dramatically scale the amount of data and compute used in model training, which resulted in larger and more sophisticated neural networks. The chart below shows the exponential growth in computation used to train AI over time. Before GPUs, the total resources dedicated to training top-performing models grew steadily at 1.5x per year; with modern GPUs, this growth accelerated to 5x every year.
Algorithmic breakthroughs: In 2017, Google researchers introduced a “self-attention” mechanism, the underpinning of the transformer model.[4] This method enabled AI to analyze relationships in data simultaneously (rather than sequentially), capturing context and long-range dependencies. It also eliminated the need to feed data in a clear, structured order, as models could dynamically assign importance to each element based on its relevance to others, regardless of their sequence. (Thanks to ongoing algorithmic modifications, AI models are currently advancing at more than twice the pace of Moore’s Law!)

Together, these advancements have transformed how machines process data, eliminating the need for rigid pre-formatting or upfront modification.[5]

In other words, computers can now handle raw, unstructured information – words, images, video – at a higher level of abstraction.

This is a familiar pattern in software. For example, transpilers abstracted manual code translation between languages,[6] and containers abstracted environment configuration, enabling applications to run seamlessly across different platforms.[7]

Now, this shift is happening with data.

This represents a significant leap forward. It’s why generative AI felt so impressive – and human – when it first hit the consumer market.

And understanding this evolution also sheds light on AI’s shortcomings – where these models still underperform, why some question the $200B annual spend on model development – and the role startups can play to harness the raw power of these models for applications in the real economy.

02 | Application layer ~ Reasoning layer

As model capabilities advance, what role can application-layer architecture play? Can startup applications build a competitive edge against incumbents that integrate directly with these models and already have advantages in data and distribution?

These are timely questions.

Today’s models excel at general knowledge. Trained on massive, internet-scale datasets, they demonstrate impressive recall and inductive reasoning. Ask an LLM about any topic – quantum physics, 17th-century Tulip Mania, or Act IV of Hamlet – and it responds thoroughly, in seconds.

However, these foundational models are still challenging to work with.

Unaided, they struggle with planning, deductive reasoning, and abstraction. For instance, they falter when prompted with superfluous information, multi-step problems, or obscured subjects – tasks that require “thinking” beyond pattern-based predictions.

This inability to reason is especially problematic when applying AI to more complex fields like supply chain management, medicine, and law, domains where context and judgment are essential.

And research indicates that simply scaling models with more data and compute is unlikely to close this gap.

While the latest mega models appear to handle complex tasks better, they still rely on pattern recognition rather than true, principled reasoning. For example, OpenAI’s o1 model – a GPT-4 variant optimized for step-by-step problem solving – still struggles to respond well to complex or unsolvable inputs, demonstrating a reliance on prediction patterns and approximate retrieval rather than true understanding. Even future models, such as OpenAI's upcoming Orion, are not expected to show substantial gains in this capability, especially when factoring in their massive development costs.

But this “reasoning gap” is one that startups can exploit, particularly those building domain-specific applications.

Until recently, many AI applications were thin “wrappers” around models, offering limited differentiation. Today, however, post-training techniques – often referred to as cognitive architecture – can create a deeper intelligence layer within applications by integrating code, prompts, and model calls to transform user input into more precise actions and responses.

In fact, these enhancements can deliver gains equivalent to a 5–30x increase in training compute – at a fraction of the cost.

Equivalent Training Compute Required to Match Post-Training Enhancement Gains

This chart shows the improvement from several post-training enhancement techniques, measured by Compute-Equivalent Gain (GEC) – the increase in pre-training compute needed to match the enhancement's performance boost. Source: https://arxiv.org/pdf/2312.07413

These techniques can also drive differentiation and establish a near-term moat.

Since post-training enhancements are most effective when tailored to specific industries, model-layer players in the race to build general capabilities are unlikely to compete directly. Non-AI-native incumbents would need to cannibalize their existing product infrastructure to keep up.

If designed well, products using these techniques can build a unique foundation, providing an edge over other startup competitors applying AI in similar markets but lacking the same technical depth and sophistication.

03 | A framework for intelligent applications

Below is a framework we’ve developed to decompose and the modern application layer. I’ve found it useful to reference when evaluating how new, AI-native applications can apply these foundational models to specific industries, and build an early, technical edge.

Collectively, these components enable AI-native applications to overcome the reasoning and planning limitations of large, general models – and deliver deeper personalization, insight, and efficiency.

Collaborative, adaptive UX drives stickiness and repeat use. Advanced training methods, proprietary training data, and domain-specific knowledge graphs produce more nuanced outputs from unstructured data than what standalone models can achieve. Modular design and scalable pipelines enhance cost-efficiency and scalability.

Below are some of the aspects to focus on, though some may be more relevant than others, depending on the company, product vision, and market.

These technical elements alone aren’t enough to establish long-term defensibility. However, they can unlock a material performance edge that – when paired with a thoughtful product roadmap, distribution, and GTM strategy – can catalyze a growth flywheel, data accumulation, and distribution advantages that ladder up to a more durable moat over time.

Now is a special moment in the evolution of AI.

We’re seeing the limits of investing in model scaling and entering a moment when durable, specialized AI-native applications can emerge. With a clearer understanding of the potential – and limits – of foundational models, we can now better pinpoint where application-layer software can bridge gaps and deliver machine intelligence to real-world use cases.

At Timespan, our ambition is to be a thoughtful, committed partner to the protagonists of this story – the founders in the earliest stages of building a modern stack, solving industry problems with unstructured data in creative ways, and moving boldly and expediently to wow customers and navigate to product-market fit.

[1] Unstructured data refers to information that doesn’t have a predefined format or organization, making it harder to analyze and interpret compared to structured data, which is organized into a format like rows and columns that computers can more easily read. Unstructured data includes things like text documents, emails, social media posts, images, audio, video files, and sensor data, which vary widely in format and content.

[2] Parallel processing splits a task into independent units that run simultaneously on multiple processing cores, allowing faster task completion. For example, in image analysis, each core might analyze a different part of an image at the same time, then combine the results for a full analysis. In contrast, sequential processing handles each part of the image one after another, using a single core to process each section in order. This sequential approach is slower, as it waits for each step to finish before moving to the next.

[3] A core is a basic processing unit within a computer's CPU or GPU, capable of executing instructions or performing calculations. Each core can handle its own task independently. In CPUs, cores are typically powerful but limited in number, making them well-suited for sequential tasks. GPUs, however, have thousands of smaller, more specialized cores that excel in parallel processing, allowing them to handle large volumes of data concurrently, which is ideal for tasks like graphics rendering or AI computations. NVIDIA’s early growth as a gaming chip company – before becoming a leader in AI – is a great reminder that technology does not track to pre-existing sectors or categories!

[4] With self-attention, each input to the model (like a word) is tokenized and represented by three components: a query (what it’s looking for in other words), a key (what it offers to others), and a value (its actual content). The model uses these queries and keys to calculate attention scores that quantify the relationships between tokens. By focusing on these connections, the model can understand context across the whole input, capturing all of these relationships in parallel. This mechanism not only boosts the predictive power of the model, but also its reasoning capability. By capturing long-range relationships and focusing on the most relevant data points, self-attention allows the model to draw from the full context, enabling it to find complex patterns and produce structured, context-aware outputs and reason through tasks (like answering questions or generating summaries) with greater accuracy and logical consistency. This mechanism is called “self-attention” because the model is only attending to parts of the input itself, rather than external data, allowing each word to "attend" to relevant words around it.

[5] This chart estimates the contributions of scaling and algorithmic innovation in terms of the raw compute that would be naively needed to achieve a state-of-the-art level of performance. The contribution of algorithmic progress is roughly half as much as that of compute scaling. Source: https://epochai.org/blog/algorithmic-progress-in-language-models.

[6] A transpiler, also known as a source-to-source compiler, is a tool that converts code written in one programming language into equivalent code in another language, typically at a similar abstraction level. Unlike a traditional compiler, which often translates high-level language code to lower-level machine code or bytecode, a transpiler converts code from one high-level language to another. This process enables developers to write code in one language but leverage the features and compatibility of another.

[7] A container is a self-contained unit of software that includes everything an application needs to run – its code, libraries, and system tools – so it works the same way in different environments. Unlike full virtual machines, containers share the host operating system, which makes them lightweight and efficient. This isolation means that each container can run independently on the same system, allowing developers to easily deploy and scale applications consistently across servers.