AI Large Language Model concept. Businessman holding a smartphone with LLM icons on virtual screen
  1. By Jacques Bughin

1. LLM is dead, long live LLMs

Large Language Models (LLMs) have been the talk of the town for the past two years, demonstrating remarkable capabilities from natural language processing to creative tasks. But if their potential demonstrates large productivity gains (see Table 1), the recent hype is now being followed by significant voices of skepticism – including claims that LLM performance improvements are plateauing

For many CEOs, these technology shortcomings, if real, add some stress. They add to other issues such as LLM so called hallucinations, and the organisational challenge of developing new generative AI capabilities on top of the make or buy of LLMs. So what should they do? Slow down their investments, at the mercy of competitors doubling down on GenAI, or push hard and commit to the technology despite its current shortcomings?

If history is any guide, there seems to be no turning  back from LLMs: academic studies  reveal that users very much like the technology and are prepared to delegate many cognitive and creative tasks to generative AI. This appeal is also visible in enterprise setting, with Gen AI usage nearly doubling in one year in large enterprises, and reaching already more than 70% of workforce. If CEOs take the time to look back at the early waves of digital technologies, they would also draw a parallel between LLM and Internet adoption. They would look back at early Internet deployments and major technical improvements. Internet access went  from dial up at 50Mb/hour by 1995 to 1G/second 25 years later, of more than 1000 times faster) creating an unique General Purpose Technology platform which have brought the FAANGS, social media, platforms and ecosystems, and major softwarisation of our economies.

2. LLM so far

The journey of LLMs such as OpenAI’s GPT, Llma, or Antropic Claude began with foundational models which leveraged transformer architectures to process and generate human-like text.On top came innovations in data volumetry and hardware.

Unlike earlier models trained on single-domain text (e.g., news or Wikipedia), modern LLMs utilize heterogeneous sources such as Common Crawl, GitHub, Wikipedia, and forums.

The largest public text data sets include RedPajama or RefinedWeb and now contain tens of trillions of words collected from web pages. Finally, the exponential growth in computing requirements has driven innovation, particularly in hardware such as GPUs, TPUs and NPUs from players such as Nvidia. While training GPT-3 on a single NVIDIA Tesla V100 GPU would take more than 3 centuries, the 1024×A100would now do it  3600 times faster.

Table 1  – case examples of LLM based productivity gains.

Use Case Industry LLM Used Performance in Productivity
Code Completion Software Development GitHub Copilot Increased coding speed by 55% and reduced cognitive load for developers
Enterprise Information Tasks Various Enterprises Microsoft Copilot Boosted task execution speed by 30% without quality loss
Wireless Communication System Telecommunications Custom LLM Achieved 40% productivity gains in code refactoring and validation
Marketing Campaigns Marketing GPT-4 Reduced content creation time by 50% and increased personalization
Customer Service Automation Retail ChatGPT Improved response times by 60% and customer satisfaction by 20%
Clinical Diagnosis Assistance Healthcare Custom LLM Reduced diagnostic errors by 20% and shortened patient waiting times by 30%
Cybersecurity Regulation Mapping Financial Services Custom LLM Reduced task completion time from months to days, achieving a 90% time reduction
Claims Processing Insurance Custom LLM Expedited processing by 70% and enhanced customer experience
Product Review Analysis E-commerce Custom LLM Enabled data-driven product development and marketing decisions, increasing efficiency by 35%
Innovation and Idea Generation Various Industries GPT-4 Accelerated innovation processes by 40% and improved idea generation

Source: Disguided cases; Arxiv; McKinsey, Accenture; Alto, V. (2023). Modern Generative AI with ChatGPT and OpenAI Models: Leverage the capabilities of OpenAI’s LLM for productivity and innovation with GPT3 and GPT4. Packt Publishing Ltd ; Coutinho, M., Marques, L., Santos, A., Dahia, M., França, C., & de Souza Santos, R. (2024, July). The role of generative AI in software development productivity: A pilot case study. In Proceedings of the 1st ACM International Conference on AI-Powered Software (pp. 131-138). Reshmi, L. B., Vipin Raj, R., & Balasubramaniam, S. (2024). 12 Generative AI and LLM: Case Study in Finance. Generative AI and LLMs: Natural Language Processing and Generative Adversarial Networks, 231 ; Bughin, J. (2024). The role of firm AI capabilities in generative AI-pair coding. Journal of Decision Systems, 1-22.; Cambon, A., Hecht, B., Edelman, B., Ngwe, D., Jaffe, S., Heger, A.,… & Teevan, J. (2023). Early LLM-based Tools for Enterprise Information Workers Likely Provide Meaningful Boosts to Productivity. Microsoft Research. MSR-TR-2023-43.

However, as the adoption of LLM rapidly spreads across industries, challenges to its use are mounting. Much of this involves the need for responsible AI, as large datasets can contain sensitive and private information, and LLMs trained as next-word predictors sometimes produce incorrect or fabricated information (‘hallucinations’), as their probabilistic nature makes it impossible to verify outputs or retract errors.

Finally, models are not easily understood and lack causality, raising suspicions about their reliability, fairness and scalability. In cybercrime, models such as FraudGPT and WormGPT are being developed to simulate cyber-attacks. All of this is being addressed by regulatory and more stringent cybersecurity frameworks, as well as techniques that reverse-engineer model outputs, such as Shap methods and others.

Another real danger is the economic and value sustainability of LLMS. Training data is inherently static, leading to rapidly outdated knowledge, and adapting LLMs to new tasks often requires resource-intensive fine-tuning. More recently, scaling models to trillions of parameters has shown diminishing returns, with marginal performance improvements leading to exponential increases in computational costs, and the fear that  high quality text data must run soon out[1].

There have been some clear warnings, such as the recent statement by Ilya Sutskever, co-founder of OpenAI, and SSI, who claimed that the “bigger (data model) is better” mantra, which has been followed with massive success by the LLM market in recent years, is likely no longer possible – due to diminishing returns to scale and the limitations of data from the public web.

3. LLM Breakthrough Innovations

It is easy to see why LLMs might plateau when their performance has been tied to massive data, and the supply of the latter is an order of magnitude lower than the push in scale of new versions of LLMs.

But does this mean that LLMs will die? First, it says that in a situation of scarce resources, other sources of data can become extremely profitable – such as those embedded in Facebook or Instagram, or in Google. Second, it also makes clear that the exploitation of data multimodalities may become attractive. In general, history is a good guide to how industries turn to innovation to solve bottlenecks and expand. LLM is likely to be no exception.

A first example is the car industry. The automobile industry experienced an explosion of demand in the early 20th century, driven by Henry Ford’s introduction of the assembly line and mass production. But bottlenecks soon became apparent, from high production costs and inconsistent manufacturing processes to unpaved and inadequate road infrastructure.  Finally, petrol availability and engine efficiency were major concerns.

Systematic innovation helped to transform the nascent industry into what it is today. Innovations such as the moving assembly line drastically reduced production costs, making cars affordable. Governments invested in building motorways and roads to support car use. Seatbelts, airbags, ABS and crash testing standards were developed, increasing confidence in vehicles, and modern vehicles now incorporate AI, sensors and IoT, turning them into intelligent systems.

Another example is the pharmaceutical industry. In the early 20th century, groundbreaking drug discoveries (such as penicillin) led to rapid expansion of the industry, but scaling up drug production to meet growing demand posed manufacturing hurdles, while developing and testing new drugs required immense investment and could take decades.

Systematic innovations included advances in computational modelling and chemistry that enabled targeted drug development, reducing the reliance on trial and error. Techniques such as fermentation for antibiotics and synthetic production methods scaled up manufacturing, while biotechnology revolutionised treatments with the development of monoclonal antibodies, gene therapies or mRNA vaccines..

4. What is in the current bag for LLMs 

LLMs also are on the verge of large innovations (Table 2). We had arleady highlighted a set of key innovationss in the making for the future of LLM in a recent article of this Review. We also had presented the evolution towards hybrid AI systems such as Neurosymbolic AI, which combines symbolic reasoning with LLMs to improve cost and  interpretability. We finally warned about the rise of multimodal models, and the attractive paths of Liquid Foundation Models (LFM) that bypass the transformer architecture to achieve strong  performance across various scales while maintaining smaller memory footprints and more efficient inference.

We here include another critical  set of five other recent innovations, suh as:

  1. Retrieval-Augmented Generation (RAG). It is often mentioned that 40–65% of AI-driven decision-making processes failed because the data is either too old or inadequate. RAG bridges the gap between static training data and dynamic, real-time knowledge. By retrieving information from external sources like databases and the web, RAG ensures responses are accurate and up-to-date. It pairs this retrieval with the generative capabilities of LLMs, offering tailored and trustworthy answers. In tasks like diagnosis support, RAG can boost precision by up to 30% over traditional LLMs, as it directly integrates real-world data retrieval.

Table 2  – Innovation parallelism

Automobile Industry Pharmaceutical Industry Large Language Models (LLMs)
Initial Bottlenecks High costs, poor infrastructure, safety concerns, fuel Discovery inefficiencies, production challenges, safety Computational costs, bias, ethical concerns, scalability
Growth Challenges Scaling production, building roads, environmental issues Long R&D timelines, regulatory delays, global access Understanding context, managing biases, reducing misuse
Key Innovations Assembly lines, safety features, IoT-enabled vehicles Rational drug design, mRNA technology, mass production Parameter-efficient models, multimodal systems, fine-tuning
Scaling Efficiency Moving assembly line, synthetic materials, mass adoption Synthetic chemistry, fermentation, generics Efficient training algorithms, model distillation
Safety and Trust Crash tests, airbags, safety regulations Clinical trials, regulatory frameworks (e.g., FDA) AI alignment, ethical AI guidelines, transparency
Infrastructure Highways, fuel stations, urban redesign Global supply chains, public-private partnerships Cloud infrastructure, edge computing
Global Access Mass production reduced costs Generic drugs, global access programs (e.g., Gavi) Open-source models, localization efforts

Source: Aggeri, F., Elmquist, M., & Pohl, H. (2009). Managing learning in the automotive industry–the innovation race for electric vehicles. International Journal of Automotive Technology and Management, 9(2), 123-147. Shimokawa, K. (2010). Japan and the global automotive industry. Cambridge University Press. Schlie, E., & Yip, G. (2000). Regional follows global: Strategy mixes in the world automotive industry. European management journal, 18(4), 343-354 ; Cusumano, M. A. (1988). Manufacturing innovation: lessons from the Japanese auto industry. MIT Sloan Management Review. Malerba, F., & Orsenigo, L. (2001). Towards a history friendly model of innovation, market structure and regulation in the dynamics of the pharmaceutical industry: the age of random screening. Roma, Italy: CESPRI-Centro Studi sui Processi di Internazionalizzazione. Kean, M. A. (2004). Biotech and the pharmaceutical industry: Back to the future. Organisation for Economic Cooperation and Development. The OECD Observer, (243), 21.

2. In-Context Learning (ICL): ICL eliminates the need for fine-tuning by providing examples within the prompt. Innovations like many-shot ICL have improved model performance on complex tasks by increasing the number and quality of examples[2].

3. Hallucination Detection and Mitigation. Techniques like SAPLMA (Statement Accuracy Prediction based on Language Model Activations) analyze internal model activations to detect inaccuracies. These methods enable real-time error correction, reducing reliance on human validation.

4. Test-Time Compute. This paradigm shifts computational focus from the training phase to inference. By allocating more resources during task execution, test-time compute enables deeper reasoning and more accurate responses without escalating training costs. Studies     such in poker AI research show that giving a model 20 seconds to process can improve decision-making performance as much as scaling up by 100,000x. Other studies also demonstrate massive improvements, with test-time compute can be used to outperform a 14x larger model.  This is particularly useful in real-time applications where better output quality is desired but can be balanced with cost efficiency.

5. LLM2Vec. LLM2Vec refines language representation by introducing bidirectional attention and advanced training techniques like Masked Next Token Prediction (MNTP) and SimCSE. These methods enhance the model’s ability to understand and represent text for clustering, semantic search, and context-sensitive applications. The ability to process all context in both directions significantly enhances performance for tasks such as content summarization or text classification. With models like Mistral-7B using LLM2Vec, accuracy on semantic search tasks has reached over 90%, with efficiency gains of 25-30% over traditional unidirectional models.

5. With new LLM innovations be paying off ?

The above clearly suggests that the future of LLM is far from bleak (Table 3). In terms of cost efficiency, innovations such as quantization,  and now Test-Time Compute are critical to reducing computational costs, especially for large enterprises or cloud-based solutions.

Techniques such as RAG, LLM2Vec and Many-Shot ICL significantly improve the accuracy performance of any LLM model, especially for specialised or task-oriented applications. This makes them ideal for industries that require high accuracy, such as healthcare, legal or finance. Also,  Retrieval-based systems (RAG) and reinforcement learning (RLHF) can greatly reduce hallucination problems by basing responses on real-world, verified data or human feedback.

Table 3 : Impact examples of new LLM innovations

Innovations Case Effciency Accuracy
RAG Healthcare knowledge assistance (e.g., ElasticSearch + GPT) Reduced time to information retrieval by 30%, improving staff efficiency Improved diagnosis support with 30% precision improvement
LLM2Vec Legal document analysis using Mistral-7B + LLM2Vec 25% more efficient compared to standard causal models Achieved near 90% accuracy for document relevance
Test-Time Compute Streaming services (dynamic subtitle generation) Dynamic scaling reduced compute costs by 30% Improved accuracy on complex tasks with more time to process
ICL Real-time essay feedback systems in education (e.g., GPT-3 with few-shot prompts) 15-25% more efficient by reducing retraining needs Performance improved for complex tasks with 5-10 examples per prompt

Source: BehnamGhader, P., Adlakha, V., Mosbach, M., Bahdanau, D., Chapados, N., & Reddy, S. (2024). Llm2vec: Large language models are secretly powerful text encoders. arXiv preprint arXiv:2404.05961. Wang, F., Lin, C., Cao, Y., & Kang, Y. (2024). Benchmarking General Purpose In-Context Learning. arXiv preprint arXiv:2405.17234.Xu, P., Ping, W., Wu, X., Xu, C., Liu, Z., Shoeybi, M., & Catanzaro, B. (2024). Chatqa 2: Bridging the gap to proprietary llms in long context and rag capabilities. arXiv preprint arXiv:2407.14482. ; Snell, C., Lee, J., Xu, K., & Kumar, A. (2024). Scaling llm test-time compute optimally can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314. ; Yue, Z., Zhuang, H., Bai, A., Hui, K., Jagerman, R., Zeng, H., … & Bendersky, M. (2024). Inference scaling for long-context retrieval augmented generation. arXiv preprint arXiv:2410.04343.

Over time, therefore, companies that exploit the innovation potential of the above for LLM/LFMS can be expected to improve significantly as more of these innovations (e.g. RAG, LLM2Vec, Sparse Models, etc.) are rolled out and integrated.

In the short term, innovations such as Test-Time Compute and Sparse Models provide significant improvements in cost efficiency, with savings of up to 40-50%. Performance improvements are also noticeable, with a 5-10% increase in accuracy. These early-stage innovations enable organisations to perform common tasks more cost-effectively.

In the medium term, we are likely to see a stronger push from innovations such as LLM2Vec and Many-Shot In-Context Learning, which improve model performance for complex tasks, before companies embrace a large combination of advanced sparse models, adaptive retrieval systems and advanced in-context learning, which will significantly improve the performance of LLMs.

In fact, using benchmarks of how these will be used and the synergies between the above innovations, we have computed an expected 25-50% gain in cost per query, a 15-30% improvement in accuracy, and a 30-45% gain in data usage, so that the cost per accurate data will drop by 16-30% per year in 5 years, or a significant drop after 5 years.

So, if LLMs initially show moderate performance in the short term (1 year), performance will more than double in the next 3 years and be up to 6 times better in 5 years, with exponential gains.  For businesses, also this means that in the short term, LLMs will be good for tasks that require less computation, but in the medium and long term, the economics and performance of LLMs will make them more accessible for a wider range of tasks, with a strong ROI.

6. The CEO journey ahead

Generative AI have truly exploded in the last two years, following the breakthough of transformers as basis for LLM platforms. In turn, the exploitation of massive amounts of public data for training has set an unique performance uplift trajectory, which however is seen to be likely plateau-ing, and with public text data entering a period of scarcity in the next 5 years

The messages for the astute executive are thus relatively clear:

  1. Be AI-ready. The AI journey is not over – with advances such as RAG, ICL, SAPLMA, test-time compute and LLM2Vec, the next generation of LLMs or liquid ones promise to be more efficient, reliable and context-aware. As the field evolves, the balance between effectiveness, cost and accuracy will shape the role of AI in business, but the quality/cost/performance triad will continue to tilt towards mass customisation of AI.
  2. Scale AI. However, the potential of AI improvements depends on embedding all these current innovations across all business systems to achieve significant gains. This includes embedding AI in customer service, sales, marketing, operations and G&A. There is no place to hide.
  3. Your data is your AI. In addition to innovative AI-based LLMs, complementary data will be key. While training data and algorithm techniques are also improving every year, data scarcity will loom, forcing companies to use their own private data outside the indexed web. Encouraging dialogue to generate a variety of data from LLMs (synthetic data) and from customer dialogue and insights will also be key.This data, when integrated, provides uniquely private and high quality data that may be needed to drive innovation in LLMs. As such, this balance of privacy and high quality is the bargaining power of any CEO organisation against a future driven solely by the LLM.The best companies will bet on those that are both AI savvy but also the best orchestrator between their closed knowledge and privacy and the need for genAI automation for the ethical and bright future of their business.

About the Author

jacquesJacques Bughin is the CEO of MachaonAdvisory and a former professor of Management. He retired from McKinsey as a senior partner and director of the McKinsey Global Institute. He advises Antler and Fortino Capital, two major VC /PE firms, and serves on the board of several companies.

References

  1. [1] The year 2032 is the median year in which the stock of high-quality public text data is expected to be exhausted. For computer vision, however, data will be exhausted at a much slower rate, averaging 2045, partly because the data size of computer vision models has so far grown three times more slowly than text. It remains to be seen whether there will be a shift as computer vision becomes less complex and less expensive.

  2. [2] Note that the technique must optimize example ordering for large effiency, for example, in problem-solving tasks, presenting the examples in the most logical order can help the model perform better.

LEAVE A REPLY

Please enter your comment!
Please enter your name here