One of the biggest trends in natural language processing (NLP) in recent months and years has been an increase in the number of large language models (LLMs) developed by leading technology firms in the world, with the most notable example being OpenAI’s ChatGPT. LLMs are a type of artificial intelligence (AI) algorithm that utilizes deep learning techniques and large datasets to generate and predict new content.[1] LLMs have demonstrated amazing abilities at generating conversations and coherent text in multiple domains such as the humanities, geopolitics, literature, physics, and computer science. GPT-4 has even passed medical, law, and business school exams, scoring in the 99th percentile of the 2020 USA Biology Olympiad Semifinal Exam[2].
The release and popularity of OpenAI’s ChatGPT LLM has jolted many of the world’s leading technology firms to develop their own LLMs. Google’s PaLM, Baidu’s Ernie Bot, and Meta’s LLaMa have all sought to compete with ChatGPT in the LLM space, raising fears of an AI arms race. While many developers train LLMs on text, some have started to train models utilizing video and audio input, flattening learning curves and resulting in faster development models. With more sophisticated training protocols, newer releases of LLMs may have the potential to markedly improve accuracy and enhance capabilities to assist researchers in expediting technologies and bringing them to market faster, as demonstrated by autonomous driving sensing regimes[3] and revolutionary drug discoveries[4].
Source : The Alan Turing Institute
However, a number of risks and uncertainties have arisen from the development of LLMs, resulting in algorithmic bias, misuse, and mismatches between production data and training data, or “drift”. One particularly emergent point of concern of LLMs has been reconciling an LLM’s training parameters and authenticity/trustworthiness, with the aim of minimizing “stochastic parrots” and hallucinations. Originally defined by Emily Bender, Timnit Gebru, Angelina McMillan-Major, and Marget Mitchell in their seminal paper, “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?”, a language model is a “system for haphazardly stitching together linguistic forms from its vast training data, according to probabilistic information about how they combine, but without any reference to meaning: a stochastic parrot.”[5] Stochastic parrots can thus be defined as repeating training data or its parameters without understanding its outputs while hallucinations occur when LLMs generate text based on their internal logic, resulting in context-based predictions that may sound correct but are factually inaccurate or contradictory.
Therefore, understanding an LLM’s explainability, such as ChatGPT-4’s parameters and underlying logic, will prove immensely helpful for businesses when confronted with testing the relative creativity of LLM’s against human innovation and ingenuity, providing AI implementers greater expertise in understanding the limits of training datasets.
An illustrative example of ChatGPT’s outputs has recently been put forward in a recent paper by Christian Terwiesch, a professor of Operations, Information and Decisions at the Wharton Mack Institute for Innovation Management of the University of Pennsylvania, a WTW/WRN partner institution, demonstrating the relative advantageous yields LLMs have in ideational creation versus their human counterparts.
To test the potential of ideational excellence of LLMs vis-à-vis humans, Terwiesch et al.[6] compared three pools of ideas for new consumer products: the first pool was created by Wharton students enrolled in a course on product design prior to the widescale availability of LLMs; a second pool generated by ChatGPT-4 with the same prompt given to the first pool; and a third pool of ideas generated by ChatGPT-4 with the aforementioned task as well as a sample of highly-rated ideas to enable in-context learning, or ICL (whereby ICL is a specific method of prompt engineering where demonstrations of the task are provided to the model as part of the prompt given to it). Given these differing pools, Terwiesch sought to address how productive ChatGPT-4 was, the quality distribution of the ideas generated, how LLMs can be used effectively in practice, and what are the implications for innovation management?
Terwiesch utilized as a benchmark two hundred randomly selected ideas from a pool generated by Wharton’s 2021 class and found that ChatGPT-4 generated two hundred ideas in 15 minutes, of which one hundred ideas were not provided examples of good ideas and another one hundred after being provided access to examples of good ideas. Such numbers contrast sharply with humans whereby humans have been shown to be able to generate five ideas in 15 minutes[7], demonstrating the productive superiority of ChatGPT-4 to humans. Furthermore, to test and measure the ideational quality of ideas, the researchers used Amazon’s MTurk platform to evaluate all four hundred ideas, with each idea evaluated twenty times and users asked to express purchasing intent. The average quality of ideas generated by ChatGPT-4 were rated more highly than ideas generated by humans while human ideas fared slightly better on novelty scores.
Source: LLM Working Paper, pg. 7
While the study had many limitations, it offers important lessons in innovation management and augmenting the human/machine synergy that will come to define a more productive and dynamic workforce in the future. Up until recently, creative tasks were the explicit domain of human beings. Apprehensions of AI thus far have been restricted to understanding the rationale behind its reasoning processes. A compelling need to understand the reasoning of such models serves two main purposes: first, to know why models generate the ideas they produce; and second, transparency in understanding the underlying generators which will lead to more sophisticated, qualitative, and expansive models (or in the case of GPT-4, understanding why it may be getting “dumber” with more inputs involved)[8]. As iterations of LLMs continue to improve due to greater access to public/private data and data from third-party providers, AI feedback, and reinforced learning, multiple independent studies have shown LLMs have now crossed a crucial threshold in solving complex scientific and mathematical equations.
Corporates will now be able to utilize LLMs for tasks as varied as cyber risk mitigation, HR practices, user support queries, organizational resilience and corporate strategy, workflow automations, and embedded software applications for streamlined business operations. While many initially feared the effect of AI on highly automated, procedurally-simple “low skill” jobs, recent LLMs have thrown that old paradigm into doubt, with white-collar jobs now also under threat. The advanced logical reasoning and creativity of LLMs will have profound implications on realistic and knowledge-intensive tasks.
However, in the short to medium term, current AI capabilities will create a “jagged technological frontier” where some tasks will be done by AI, while the rest will be performed by humans[9]. A recent Harvard Business School (HBS) study, in collaboration with Boston Consulting Group (BCG), has found the best strategy moving forward is a synergistic and symbiotic machine-human collaboration of humans upskilling into learning and adopting AI platforms such as ChatGPT. Utilizing 8% of BCG’s global workforce, researchers found consultants who used ChatGPT-4 on twenty realistic tasks had a 40% improvement in quality, got 26% more tasks done, and completed them 12.5% faster.[10] Given these figures, businesses will be hard pressed not to adopt such practices within their organizations as competition will invariably force a race towards greater efficiency and productivity in various industries and across multiple domains.
Corporations will soon have access to their own internal LLMs, similar to how intranets exist in conjunction with the larger internet. Risk mitigation strategies for internal purposes will depend on engineers planning proper evaluations, selecting and training datasets intentionally, engaging and incorporating high-level stakeholders in the creation, upkeep, and maintenance of LLMs, and having an AI guidance framework in place to deal with exigent circumstances and worst case scenarios.
Incorporating LLMs into bonafide use cases will result in more methodical assessments in risk enterprises, financial analyses and planning, fraud detection, enhanced innovation, increased worker productivity, and ideational creation. Retooling, upskilling, and adopting dynamic AI platforms such as LLMs and others will ensure businesses, executives, and workers stay on the cutting edge of implementation practices as AI will continue to disrupt how businesses engage with their customers and how society adapts to ever present disruptive technologies.