A Beginner's Guide To Large Language Models
A Beginner's Guide To Large Language Models
Contributors:
Annamalai Chockalingam
Ankur Patel
Shashank Verma
Tiffany Yeung
Preface
Regardless of when it first appeared, language remains the cornerstone of human communication. It
has taken on an even greater role in today’s digital age, where an unprecedented portion of the
population can communicate via both text and speech across the globe.
This is underscored by the fact that 347.3 billion email messages are sent and received worldwide
every day, and that five billion people – or over 63% of the entire world population – send and receive
text messages.
Language has therefore become a vast trove of information that can help enterprises extract valuable
insights, identify trends, and make informed decisions. As an example, enterprises can analyze texts
like customer reviews to identify their products’ best-selling features and fine-tune their future
product development.
However, both language analysis and production are time-consuming processes that can distract
employees and decision-makers from more important tasks. For instance, leaders often need to sift
through vast amounts of text in order to make informed decisions instead of making them based on
extracted key information.
Enterprises can minimize these and other problems, such as the risk of human error, by employing
large language models (LLMs) for language-related tasks. LLMs can help enterprises accelerate and
largely automate their efforts related to both language production and analysis, saving valuable time
and resources while improving accuracy and efficiency.
Unlike previous solutions, such as rule-based systems, LLMs are incredibly versatile and can be easily
adapted to a wide range of language-related tasks, like generating content or summarizing legal
documentation.
> Part 1 defines LLMs and outlines the technological and methodological advancements over the
years that made them possible. It also tackles more practical topics, such as how enterprises can
develop their own LLMs and the most notable companies in the LLM field. This should help
enterprises understand how adopting LLMs can unlock cutting-edge possibilities and revolutionize
their operations.
> Part 2 discusses five major use cases of LLMs within enterprises, including content generation,
summarization, and chatbot support. Each use case is exemplified with real-life apps and case
studies, so as to show how LLMs can solve real problems and help enterprises achieve specific
objectives.
> Part 3 is a practical guide for enterprises that want to build, train, and deploy their own LLMs. It
provides an overview of necessary pre-requirements and possible trade-offs with different
development and deployment methods. ML engineers and data scientists can use this as a
reference throughout their LLM development processes.
Hopefully, this will inspire enterprises that have not yet adopted or developed their own LLMs to do
so soon in order to gain a competitive advantage and offer new SOTA services or products. The most
benefits will be, as usual, reserved for early adopters or truly visionary innovators.
Terms Description
Deep learning systems Systems that rely on neural networks with many hidden layers to
learn complex patterns.
Generative AI AI programs that can generate new content, like text, images,
and audio, rather than just analyze it.
Large language models (LLMs) Language models that recognize, summarize, translate, predict,
and generate text and other content. They’re called large
because they are trained on large amounts of data and have
many parameters, with popular LLMs reaching hundreds of
billions of parameters.
Natural language processing (NLP) The ability of a computer program to understand and generate
text in natural language.
Long short-term memory neural network (LSTM) A special type of RNNs with more complex cell blocks that allow
it to retain more past inputs.
Natural language generation (NLG) A part of NLP that refers to the ability of a computer program to
generate human-like text.
Natural language understanding (NLU) A part of NLP that refers to the ability of a computer program to
understand human-like text.
Neural network (NN) A machine learning algorithm in which the parameters are
organized into consecutive layers. The learning process of NNs is
inspired by the human brain. Much like humans, NNs “learn”
important features via representation learning and require less
human involvement than most other approaches to machine
learning.
Perception AI AI programs that can process and analyze but not generate data,
mainly developed before 2020.
Recurrent neural network (RNN) Neural network that processes data sequentially and can
memorize past inputs.
Traditional machine learning Traditional machine learning uses a statistical approach, drawing
probability distributions of words or other tokens based on a
large annotated corpus. It relies less on rules and more on data.
Structured data Data that is quantitative in nature, such as phone numbers, and
can be easily standardized and adjusted to a pre-defined format
that ML algorithms can quickly process.
Unstructured data Data that is qualitative in nature, such as customer reviews, and
difficult to standardize. Such data is stored in its native formats,
like PDF files, before use.
Parameter-efficient techniques (PEFT) Techniques like prompt learning, LoRa, and adapter tuning
which allow researchers to customize PLMs for downstream
tasks or datasets whil preserving and leveraging existing
knowledge of PLMs. These techniques are used during model
customization and allow for quicker training and often more
accurate predictions.
Prompt learning An umbrella term for two PEFT techniques, prompt tuning and
p-tuning, which help customize models by inserting virtual token
embeddings among discrete or real token embeddings.
Open-domain question answering Answering questions from a variety of different domains, like
legal, medical, and financial, instead of just one domain.
Extractive question answering Answering questions by extracting the answers from existing
texts or databases.
Data Readiness The suitability of data for use in training, based on factors such
as data quantity, structure, and quality.
Large language models unlocked numerous unprecedented possibilities in the field of NLP and AI. This
was most notably demonstrated by the release of OpenAI’s GPT-3 in 2020, the then-largest language
model ever developed.
These models are designed to understand the context and meaning of text and can generate text that
is grammatically correct and semantically relevant. They can be trained on a wide range of tasks,
including language translation, summarization, question answering, and text completion.
GPT-3 made it evident that large-scale models can accurately perform a wide – and previously
unheard-of – range of NLP tasks, from text summarization to text generation. It also showed that
LLMs could generate outputs that are nearly indistinguishable from human-created text, all while
learning on their own with minimal human intervention.
This presented an enormous improvement from earlier, mainly rule-based models that could neither
learn on their own nor successfully solve tasks they weren’t trained on. It is no surprise, then, that
many other enterprises and startups soon started developing their own LLMs or adopting existing
LLMs in order to accelerate their operations, reduce expenses, and streamline workflows.
Part 1 is intended to provide a solid introduction and foundation for any enterprise that is considering
building or adopting its own LLM.
They’re also a subset of a more general technology called language models. All language models have
one thing in common: they can process and generate text that sounds like natural language. This is
known as performing tasks related to natural language processing (NLP).
2. They comprise a huge number of learnable parameters (i.e., representations of the underlying
structure of training data that help models perform tasks on new or never-before-seen data).
Table 1 showcases two large language models, MT-NLG and GPT-3 Davinci, to help clarify what’s
considered large by contemporary standards.
Since the quality of a model heavily depends on the model size and the size of training data, larger
language models typically generate more accurate and sophisticated responses than their smaller
counterparts.
However, the performance of large language models doesn’t just depend on the model size or data
quantity. Quality of the data matters, too.
For example, LLMs trained on peer-reviewed research papers or published novels will usually perform
better than LLMs trained on social media posts, blog comments, or other unreviewed content. Low-
quality data like user-generated content may lead to all sorts of problems, such as models picking up
slang, learning incorrect spellings of words, and so on.
In addition, models need very diverse data in order to perform various NLP tasks. However, if the
model is intended to be especially good at solving a particular set of tasks, then fine-tune it using a
more relevant and narrower dataset. By doing so a foundation language model is transformed — from
one that’s good at performing various NLP tasks across a broad set of domains – into a fine-tuned
model that specializes in performing tasks in a narrowly scoped domain.
Thanks to their size, foundation models can perform well even when they have little domain-specific
data at their disposal. They have good general performance across tasks but may not excel at
performing any one specific task.
Fine-tuned language models, on the other hand, are large language models derived from foundation
LLMs. They’re customized for specific use cases or domains and, thus, become better at performing
more specialized tasks.
Apart from the fact that fine-tuned models can perform specific tasks better than foundation models,
their biggest strength is that they are lighter and, generally, easier to train. But how does one actually
fine-tune a foundation model for specific objectives?
Currently, the most popular method is customizing a model using parameter-efficient customization
techniques, such as p-tuning, prompt tuning, adapters, and so on. Customization is far less time-
consuming and expensive than fine-tuning the entire model, although it may lead to somewhat
poorer performance than other methods. Customization methods are further discussed in Part 3.
The advent of large language models further fueled a revolutionary paradigm shift in the way NLP
models are designed, trained, and used. To truly understand this, it may be helpful to compare large
language models to previous NLP models and how they worked. For this purpose, let’s briefly explore
three regimes in the history of NLP: pre-transformers NLP, transformers NLP, and LLM NLP.
1. Pre-transformers NLP was mainly marked by models that relied on human-crafted rules rather
than machine learning algorithms to perform NLP tasks. This made them suitable for simpler tasks
that didn’t require too many rules, like text classification, but unsuitable for more complex tasks,
such as machine translation. Rule-based models also performed poorly in edge-case scenarios
because they couldn’t make accurate predictions or classifications for never-before-seen data for
which no clear rules were set. This problem was somewhat solved with simple neural networks,
such as RNNs and LSTMs, developed during the later phases of this period. RNNs and LSTMs could
memorize past data to a certain extent and, thus, provide context-dependent predictions and