Generative AI has already made a big splash, but just how wet we’re all going to get in the future is still a matter for speculation. Nevertheless, as you reach for your towel, Jacques Bughin has a few solid predictions for you.
Large language models (LLMs), exemplified by Generative Pre-trained Transformer 4 (GPT-4) from OpenAI, launched the year 2023, ushering in a new revolution in AI and machine learning. While traditional machine learning models often rely on single-source data sets, LLMs, built on transformer neural network architectures, undergo training on an unprecedented scale of compute and data. This results in impressive capabilities, encompassing tasks that were once exclusive to humans, such as reasoning, abstraction, and projection.
As we explore the potential of these models, it’s essential to contemplate the future. Here are five predictions.
LLMs will become (even) better skilled
In less than a year, LLMs have delivered an impressive evolution, expanding from text-based generative AI to incorporating voice and vision in models like GPT4.0. Looking back, the original GPT could not generally produce coherent text by 2018, while a few months later, GPT-2 could only follow simple instructions. GPT-3 and now GPT-4 now can perform a wide range of language tasks on a par with humans.
Other models, such as Google’s Gemini, Nvidia’s Falcon, and Claude, have also been enhancing their performance to compete with OpenAI products.
Over the recent five years, LLMs have, on average, improved their accuracy on the multitask understanding (MMLU) scale, reaching human expert-level accuracy[1]. This performance has been closely scaling with computational resources. Larger models, such as GPT-3, showcase significantly enhanced capabilities compared to their predecessors, leveraging approximately 20,000 more data, computation and parameters.
Another reason for LLM chasing size is that LLMs have demonstrated a massive burst in abilities around programming or arithmetic, for a certain threshold of models. In general, performance improves with scale roughly gradually and predictably when the basis is the knowledge or memorisation component, but can exhibit “breakthrough” behaviour when it requires much more complex reasoning. Those breakthroughs are only to be seen now, and the race for performance among large LLM providers competing for their share, from Google to Microsoft and others, suggests that the performance will continue to accelerate in the months to come.
Performance is “more with less”
Based on estimation by DeepMind and Meta, we estimate that the performance function increases by a factor of 10 for a model 10 times larger on both parameters and tokens. This also means that the scale needed to continue improvements is challenging, and that “big is beautiful” may need some twist going further. With this path, the cost of doing LLM is computationally costly, and inevitably raises sustainability issues, let alone that LLMs will have soon consumed all the public data to train and tailor their model.
How then will performance increase in a paradigm shift involving “more with less”? We already see a bit of segmentation happening between some models (those above 10 billion parameters, such as Llama, Alpaca and Falcon), and the very big ones with hundreds of billions of parameters. The 10 billion parameter models are, more often than not, based on OpenSource, but are also experimenting with new methods to break away from high computing costs.
Those models are looking at new ways around data, such as data repetition, data augmentation, synthetic data on top of merging generative AI with reinforcement learning. Repetitive data helps LLM to learn the content structure and patterns more effectively, but there is clearly a risk that duplicate data may overwhelm the training process of LLM and degrade the model capabilities. The challenge will be thus to find the right balance between degradation and learning.
Reinforcement data techniques have proven to be extremely successful in the past, from Tesauro’s system by 1995 that used reinforcement learning to learn how to play backgammon at a very strong master’s level, or Crites and Barto’s (1996) techniques to optimally dispatch elevators in a multi-storey building. These are quickly being used for refining LLMs and supporting the accuracy of LLM models. The customisation of GPT is one trick to make people more likely to use LLMs and provide reinforcement clues, even if, today, the performance of the model is not necessarily better but, in the medium term, models are sure to learn extensively from the feedback loops of intensive and diversifiable usage.
A promising capability is the use of synthetic data, which has proven its capability beside LLM in many circumstances already. A publicly known case is the Nvidia simulator application in its industrial metaverse that successfully leverages synthetic data to train robots. In general also, synthetic data can be used to rebalance samples when the required prediction concerns rare events such as financial fraud or manufacturing defects. Generating synthetic instances of such events increases model accuracy. Watch this space, but the “more with less” is here to stay and expand.
Cracking xAI for LLM
Explainable AI, or xAI, is the capacity of AI models to elucidate the reasoning behind their actions in a manner comprehensible to humans. This transparency is paramount for user trust, particularly in safety-critical domains like healthcare, self-driving cars, or finance. AI providers align with the need for understanding flaws and biases in models to enhance accuracy.
xAI has seen substantial progress outside LLMs, employing attribution techniques like gradient-based methods and SHAP values. However, challenges for LLMs are twofold. First, these techniques demand considerable computational power to explain billion-parameter LLMs. Second, the sheer volume of content absorbed by LLMs surpasses human capacity for absorption, complicated by the high non-linearity of neuronal networks. This is not an issue of “black box” for users only, but designers themselves. Remember that the model improvements could not be anticipated at the release of GPT-3; it took some time for Open AI among others to realise GPT-3 breakthrough capabilities of few-shot learning, and chain-of-thought reasoning.
There is currently no robust technique providing a clear logic of how LLMs operate, due to the complexity of dealing with billions of connections between artificial neurons. Despite claims, recent research by OpenAI suggests that, at the level of each neuron, only 2 per cent exhibit explainable power above 70 per cent.
The field of xAI is flourishing out of necessity for LLM survival. Various tools, including geometry approaches, aim to make LLM models more transparent. Alternative solutions involve exploring techniques beyond pure neural-network-based models, such as merging with symbolic AI or employing synthetic data.
LLM goes to the edge
Users have often experienced latency issues with products like ChatGPT, and concerns about the privacy and security of data ingested in LLMs have often arisen due to their reliance on cloud computing. Edge AI presents a potential solution, offering connected intelligence with low latency, as well as privacy-preserving AI services at the network edge.
A virtuous cycle emerges where LLMs optimise last-mile network consumption, benefiting telecom providers through improved AI-based scheduling and resource allocation algorithms. This results in decreased latency and enhanced spectral efficiency.
Monetary considerations come into play as well. The value of LLMs lies in their application, not just in training sets. This suggests a shift towards smaller, versatile LLM models for simpler applications, such as human interactions and conversations. Techniques like parameter-efficient fine-tuning and QLoRA can fine-tune smaller models to perform like state-of-the-art LLMs.
This trend not only secures the demand side but also sees consumer tech players finding value in making their devices more intelligent, rather than relying solely on cloud providers. The intersection of 5G and LLMs is expected to create a small revolution, with companies like Apple, Samsung, and others actively working on it.
Implications of this trend include the potential for LLM models to reside at the edge, such as on smartphones, enabling real-time conversational activities like image collection and chatbot discussions. In business contexts, tailored LLMs facilitate intelligent and private access to company-specific information, supporting the vision of widespread mobile and remote work. Additionally, robots with LLM capabilities could act in real-time collaboration with humans.
LLM will kick off the age of AI transformation
The hype aside, 2024 appears poised to witness the transition from testing to real competitive usage of generative AI by companies. Presently, many business users engage ChatGPT for programming and marketing support. But despite all the hype, the real industrialisation of LLM-based transformation takes time.
However, the true industrialisation of LLM-based transformations is a gradual process. Several factors contribute to this gradual progression, with the absence of established usage policies being a prominent one. Concerns related to security, hallucination, and other issues are valid, yet the underlying reality is that enterprises possess a valuable reserve of private data. This data can fuel the development of domain-specific models, enabling the integration of AI as a strategic advantage within their enterprise.
Major tech entities like Microsoft and Google are evidently targeting the enterprise markets, leveraging their wealth of data. Simultaneously, other players are entering the arena with new models as a service tailored for the enterprise space. Singapore-based WIZ.AI’s LLMs exemplify this trend, seizing the enterprise AI opportunity.
The evolving value chain is swiftly organising around vertical application providers, some integrated with LLMs. Examples include Claude’s emphasis on education, FinGPT specialising in finance, Google’s focus on healthcare, Dramatron contributing to theatres and movie scripting, or Codex powering GitHub Copilot for software development.
The era of digital transformation is now a part of history, making way for the era of AI transformation. The signs are that 2024 will be the year, with companies positioning themselves to harness the potential of generative AI, recognising it as a pivotal element in reshaping their business landscape.
About the Author
Jacques Bughin is CEO of MachaonAdvisory and a former professor of management who retired from McKinsey as senior partner and director of the McKinsey Global Institute. He advises Antler and Fortino Capital, two major VC/PE firms, and serves on the board of a number of companies.
References
- [1] Hendrickx and colleagues (2021) creates the test based on 57 multiple-choice questions spanning topics in the humanities to mathematical sciences, and collected from a wide array of practice questions for tests such as the Graduate Record Examination, the United States Medical Licensing Examination or the Examination for Professional Practice in Psychology. According to the authors, human-level accuracy on this test varies. They report that “unspecialised humans from Amazon Mechanical Turk obtain 34.5 per cent accuracy on this test. (…) real-world test-taker human accuracy at the 95th percentile is around 87 per cent for US Medical Licensing Examinations”.
- Balestriero, Randall, Romain Cosentino, and Sarath Shekkizhar. “Characterizing Large Language Model Geometry Solves Toxicity Detection and Generation.” arXiv preprint arXiv:2312.01648 (2023).
- Bughin, Jacques (2023, October 1). “To ChatGPT or not to ChatGPT: A note to marketing executives”. In Applied Marketing Analytics: The Peer-Reviewed Journal, Volume 9, Issue 2.
- Buchatskaya, E., Cai, T., Rutherford, E., de las Casas, D., Hendricks, L. A., Welbl, J., et al. (2022), “An empirical analysis of compute-optimal large language model training”. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances in Neural Information Processing Systems, 2022.
- Chenghao Huang, Siyang Li, Ruohong Liu, Hao Wang, and Yize Che (2023), “Large Foundation Models for Power Systems.” ArXiv, December 2023.
- Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “QLoRA: Efficient Finetuning of Quantized LLMs,” arXiv preprint arXiv:2305.14314, May 2023.
- Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., & Steinhardt, J. (2020). “Measuring massive multitask language understanding.” arXiv preprint arXiv:2009.03300.
- Hoffmann, J., S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. de Las Casas, L. A. Hendricks, J. Welbl, A. Clark, T. Hennigan, E. Noland, K. Millican, G. van den Driessche, B. Damoc, A. Guy, S. Osindero, K. Simonyan, E. Elsen, J. W. Rae, O. Vinyals, and L. Sifre, “Training compute-optimal large language models,” vol. abs/2203.15556, 2022.
- Kaplan, J. S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws for neural language models,” CoRR, vol. abs/2001.08361, 2020.
- Kashefi, Rojina, et al. “Explainability of Vision Transformers: A Comprehensive Review and New Perspectives.” arXiv preprint arXiv:2311.06786 (2023).
- Letaief, Khaled B., et al. “Edge artificial intelligence for 6G: Vision, enabling technologies, and applications.” IEEE Journal on Selected Areas in Communications 40.1 (2021): 5-36.
- Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao, L., Wiegreffe, S., Alon, U., Dziri, N., Prabhumoye, S., Yang, Y. and Welleck, S. (2023). “Self-refine: Iterative refinement with self-feedback.” arXiv preprint arXiv:2303.17651.
- Mehra, Akshit (2023). “Data Collection and Preprocessing for Large Language Models,” June, Labellerr, accessed on 16 December.
- Srivastava, A., Rastogi, A., Rao, A., Shoeb, A. A. M., Abid, A., Fisch, A., Brown, A. R., Santoro, A., Gupta, A., Garriga-Alonso, A., et al. “Beyond the imitation game: Quantifying and extrapolating the capabilities of language models.” arXiv preprint 2206.04615, 2022.
- Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Roziere, B., Goyal, N., Hambro, E., Azhar, F., et al. (2023), “LLaMA: Open and efficient foundation language models“, arXiv preprint 2302.13971.
- Zhao, H., Chen, H., Yang, F., Liu, N., Deng, H., Cai, H., … & Du, M. (2023). “Explainability for large language models: A survey.” arXiv preprint arXiv:2309.01029.
- Zhao, Wayne Xin, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, et al. “A survey of large language models.” arXiv preprint arXiv:2303.18223 (2023).
- Zhou, D., Scharli, N., Hou, L., Wei, J., Scales, N., Wang, X., Schuurmans, D., Cui, C., Bousquet, O., Le, Q. V., and Chi, E. H. (2023) “Least-to-most prompting enables complex reasoning in large language models.” In: The Eleventh International Conference on Learning Representations, 2023. [https://openreview.net/forum?id=WZH7099tgfM]