LLMs Framework
LLMs Framework
Language Models
with Azure Open AI:
Conversational
programming and prompt
engineering with LLMs
Francesco Esposito
Programming Large Language Models with Azure Editor-in-Chief
Open AI: Conversational programming and prompt Brett Bartow
engineering with LLMs Executive Editor
Published with the authorization of Microsoft Corporation by: Loretta Yates
Pearson Education, Inc.
Associate Editor
Copyright © 2024 by Francesco Esposito.
Shourav Bose
All rights reserved. This publication is protected by copyright, and permission
must be obtained from the publisher prior to any prohibited reproduction, Development Editor
storage in a retrieval system, or transmission in any form or by any means, Kate Shoup
electronic, mechanical, photocopying, recording, or likewise. For information
Managing Editor
regarding permissions, request forms, and the appropriate contacts within the
Pearson Education Global Rights & Permissions Department, please visit www. Sandra Schroeder
pearson.com/permissions.
Senior Project Editor
No patent liability is assumed with respect to the use of the information con-
Tracey Croom
tained herein. Although every precaution has been taken in the preparation
of this book, the publisher and author assume no responsibility for errors or Copy Editor
omissions. Nor is any liability assumed for damages resulting from the use of
Dan Foster
the information contained herein.
ISBN-13: 978-0-13-828037-6 Indexer
A I.
Perché non dedicarti un libro sarebbe stato un sacrilegio.
Contents at a Glance
Introduction xiii
Index 234
Contents
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Basic techniques. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Zero-shot scenarios. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Few-shot scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Chain-of-thought scenarios. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
LLM limitations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
vii
Chapter 3 Engineering advanced learning prompts 51
What’s beyond prompt engineering?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Combining pieces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Fine-tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Function calling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Homemade-style. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
OpenAI-style. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
LangChain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Models, prompt templates, and chains. . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Agents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
Data connection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
viii Contents
Chapter 5 Security, privacy, and accuracy concerns 131
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
Responsible AI. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
Red teaming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
Abuse and content filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
Hallucination and performances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
Bias and fairness. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
Contents ix
Main UI features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
Pros and cons in production. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
x Contents
Acknowledgments
I n the spring of 2023, when I told my dad how cool Azure OpenAI was becoming, his
reply was kind of a shock: “Why don’t you write a book about it?” He said it so naturally
that it hit me as if he really thought I could do it. In fact, he added, “Are you up for it?”
Then there was no need to say more. Loretta Yates at Microsoft Press enthusiastically
accepted my proposal, and the story of this book began in June 2023.
AI has been a hot topic for the better part of a decade, but the emergence of new-
generation large language models (LLMs) has propelled it into the mainstream. The
increasing number of people using them translates to more ideas, more opportunities,
and new developments. And this makes all the difference.
Hence, the book you hold in your hands can’t be the ultimate and definitive guide to
AI and LLMs because the speed at which AI and LLMs evolve is impressive and because—
by design—every book is an act of approximation, a snapshot of knowledge taken at a
specific moment in time. Approximation inevitably leads to some form of dissatisfaction,
and dissatisfaction leads us to take on new challenges. In this regard, I wish for myself
decades of dissatisfaction. And a few more years of being on the stage presenting books
written for a prestigious publisher—it does wonders for my ego.
First, I feel somewhat indebted to all my first dates since May because they had to
endure monologues lasting at least 30 minutes on LLMs and some weird new approach
to transformers.
True thanks are a private matter, but publicly I want to thank Martina first, who
cowrote the appendix with me and always knows what to say to make me better. My
gratitude to her is keeping a promise she knows. Thank you, Martina, for being an
extraordinary human being.
I also want to thank Matteo, Luciano, Gabriele, Filippo, Daniele, Riccardo, Marco,
Jacopo, Simone, Francesco, and Alessia, who worked with me and supported me
during my (hopefully not too frequent) crises. I also have warm thoughts for
Alessandro, Antonino, Sara, Andrea, and Cristian who tolerated me whenever we
weren’t like 25-year-old youngsters because I had to study and work on this book.
xi
To Mom and Michela, who put up with me before the book and probably will
continue after. To my grandmas. To Giorgio, Gaetano, Vito, and Roberto for helping
me to grow every day. To Elio, who taught me how to dress and see myself in more
colors.
As for my dad, Dino, he never stops teaching me new things—for example, how
to get paid for doing things you would just love to do, like being the technical editor
of this book. Thank you, both as a father and as an editor. You bring to my mind a
song you well know: “Figlio, figlio, figlio.”
Beyond Loretta, if this book came to life, it was also because of the hard work of
Shourav, Kate, and Dan. Thank you for your patience and for trusting me so much.
xii Acknowledgments
Introduction
T his is my third book on artificial intelligence (AI), and the first I wrote on my own,
without the collaboration of a coauthor. The sequence in which my three books
have been published reflects my own learning path, motivated by a genuine thirst to
understand AI for far more than mere business considerations. The first book, pub-
lished in 2020, introduced the mathematical concepts behind machine learning (ML)
that make it possible to classify data and make timely predictions. The second book,
which focused on the Microsoft ML.NET framework, was about concrete applica-
tions—in other words, how to make fancy algorithms work effectively on amounts of
data hiding their complexity behind the charts and tables of a familiar web front end.
And for decades, we did dream of these applications. Literature and movies
presented various supercomputers capable of crunching any sort of data to pro-
duce human-intelligible results. An extremely popular example was HAL 9000—the
computer that governed the spaceship Discovery in the movie 2001: A Space Odyssey
(1968). Another famous one was JARVIS (Just A Rather Very Intelligent System), the
computer that served Tony Stark’s home assistant in Iron Man and other movies in
the Marvel Comics universe.
Often, all that the human characters in such books and movies do is simply “load
data into the machine,” whether in the form of paper documents, digital files, or
media content. Next, the machine autonomously figures out the content, learns
from it, and communicates back to humans using natural language. But of course,
those supercomputers were conceived by authors; they were only science fiction.
Today, with LLMs, it is possible to devise and build concrete applications that not
only make human–computer interaction smooth and natural, but also turn the old
dream of simply “loading data into the machine” into a dazzling reality.
xiii
This book shows you how to build software applications using the same type of
engine that fuels ChatGPT to autonomously communicate with users and orches-
trate business tasks driven by plain textual prompts. No more, no less—and as easy
and striking as it sounds!
Assumptions
To fully grasp the value of a programming book on LLMs, there are a couple of
prerequisites, including proficiency in foundational programming concepts and a
familiarity with ML fundamentals. Beyond these, a working knowledge of relevant
programming languages and frameworks, such as Python and possibly ASP.NET
Core, is helpful, as is an appreciation for the significance of classic natural language
processing in the context of business domains. Overall, a blend of programming
expertise, ML awareness, and linguistic understanding is recommended for a
comprehensive grasp of the book’s content.
xiv Introduction
This book might not be for you if…
This book might not be for you if you’re just seeking a reference book to find out in
detail how to use a particular pattern or framework. Although the book discusses
advanced aspects of popular frameworks (for example, LangChain and Semantic
Kernel) and APIs (such as OpenAI and Azure OpenAI), it does not qualify as a pro-
gramming reference on any of these. The focus of the book is on using LLMs to build
useful applications in the business domains where LLMs really fit well.
■■ The first three chapters delve into scenarios for which an LLM is effective and
introduce essential tools for crafting sophisticated solutions. These chapters
provide insights into conversational programming and prompting as a new,
advanced, yet structured, approach to coding.
■■ The next two chapters emphasize patterns, frameworks, and techniques for
unlocking the potential of conversational programming. This involves using
natural language in code to define workflows, with the LLM-based applica-
tion orchestrating existing APIs.
Introduction xv
Downloads: notebooks and samples
Python and Polyglot notebooks containing the code featured in the initial part of
the book, as well as the complete codebases for the examples tackled in the latter
part of the book, can be accessed on GitHub at:
https://github.com/Youbiquitous/programming-llm
MicrosoftPressStore.com/LLMAzureAI/errata
If you discover an error that is not already listed, please submit it to us at the
same page.
Please note that product support for Microsoft software and hardware is not
offered through the previous addresses. For help with Microsoft software or hard-
ware, go to http://support.microsoft.com.
Stay in touch
Let’s keep the conversation going! We’re on X / Twitter: http://twitter.com/
MicrosoftPress.
xvi Introduction
CHAPTER 2
P rompt learning techniques play a crucial role in so-called “conversational programming,” the new
paradigm of AI and software development that is now taking off. These techniques involve the
strategic design of prompts, which are then used to draw out desired responses from large language
models (LLMs).
Prompt engineering is the creative sum of all these techniques. It provides developers with the tools
to guide, customize, and optimize the behavior of language models in conversational programming
scenarios. Resulting prompts are in fact instrumental in guiding and tailoring responses to business
needs, improving language understanding, and managing context.
Prompts are not magic, though. Quite the reverse. Getting them down is more a matter of trial
and error than pure wizardry. Hence, at some point, you may end up with prompts that only partially
address the very specific domain requests. This is where the need for fine-tuning emerges.
Prompts at a glance
Let’s try some prompts with a particular LLM—specifically, GPT-3.5-turbo. Be aware, though, that LLMs
are not deterministic tools, meaning that the response they give for the same input may be different
every time.
Note Although LLMs are commonly described as non-deterministic, “seed” mode is now
becoming more popular—in other words, seeding the model instead of sampling for a fully
reproducible output.
25
A very basic prompt
The hello-world of prompt engineering—easily testable online on Bing Chat, ChatGPT, or something
similar—can be as simple as what’s shown here:
Overall, the answer makes sense: The model tries to provide a continuation of the string, given the
understood context.
During the week, I doth engage in myriad tasks and endeavors, as the sun traverseth the sky with
unwavering pace.
So far so good.
'Unacceptable risk AI systems are systems considered a threat to people and will be banned. They
include:
-Cognitive behavioral manipulation of people or specific vulnerable groups: for example,
voice-activated toys that encourage dangerous behavior in children
-Social scoring: classifying people based on behavior, socio-economic status, or personal
characteristics
-Real-time and remote biometric identification systems, such as facial recognition
Some exceptions may be allowed: For instance, "post" remote biometric identification systems
where identification occurs after a significant delay will be allowed to prosecute serious
crimes but only after court approval.'
Given the above, extract only the forbidden AI applications and output them as json.
{
"Forbidden AI Applications":[
{
"Application":"Cognitive behavioral manipulation of people or specific vulnerable groups",
"Example": "Voice-activated toys that encourage dangerous behavior in children"
},
{
"Application":"Social scoring",
"Example":"Classifying on behavior, socio-economic status or personal characteristics"
},
Encouraged by these first experiments, let’s try to outline some general rules for prompts.
Designing a prompt is an iterative process. Not surprisingly, the first reply you get from a model
might be quite unreasonable. Don’t give up; just try again, but be more precise in what you provide,
whether it’s plain instructions, input data, or context.
Two key points for a good prompt are specificity and descriptiveness.
■■ Specificity means designing prompts to leave as little room for interpretation as possible. By
providing explicit instructions and restricting the operational space, developers can guide the
language model to generate more accurate and desired outputs.
■■ Descriptiveness plays a significant role in effective prompt engineering. By using analogies and
vivid descriptions, developers can provide clear instructions to the model. Analogies serve as
valuable tools for conveying complex tasks and concepts, enabling the model to grasp the
desired output with improved context and understanding.
Extract sentiment from the following text delimited by triple backticks: '''Great choice!'''
When the first attempt fails, two simple design strategies might help:
■■ Doubling down on instructions is useful to reinforce clarity and consistency in the model’s
responses. Repetition techniques, such as providing instructions both before and after the
primary content or using instruction-cue combinations, strengthen the model’s understanding
of the task at hand.
■■ Changing the order of the information presented to the model. The order of information
presented to the language model is significant. Whether instructions precede the content
(summarize the following) or follow it (summarize the preceding) can lead to different
One last thing to consider is an exit strategy for the model in case it fails to respond adequately.
The prompt should instruct the model with an alternative path—in other words, an out. For instance,
when asking a question about some documents, including a directive such as write 'not found'
if you can't find the answer within the document or check if the conditions are
satisfied before answering allows the model to gracefully handle situations in which the desired
information is unavailable. This helps to avoid the generation of false or inaccurate responses.
Beyond the purely prompt-based conversational approach, there are a few screws to tighten—
comparable to the old-but-gold hyperparameters in the classic machine learning approach. These
include the number of tokens, temperature, top_p (or nucleus) sampling, frequency penalties, presence
penalties, and stop sequences.
Temperature affects the probability distribution of potential tokens at each step of the generation
process. In practice, when choosing the next token, a model with a temperature of 0 will always choose
the most probable one, while a model with a higher temperature will choose a token more or less
randomly. A temperature of 0, therefore, would make the model entirely deterministic.
Note As discussed in Chapter 1, the temperature parameter works on the LLM’s last layer,
being a parameter of the softmax function.
An alternative technique called top_p sampling (or nucleus sampling) is also useful for altering the
default behavior of the LLM when generating the next token. With top_p sampling, instead of con-
sidering all possible tokens, the LLM focuses only on a subset of tokens (known as the nucleus) whose
cumulative probability mass adds up to a specified threshold called top_p.
With top_p, the range of acceptable values is from 0 to 1. If, for instance, if the top_p value is set to
0.3, the language model will consider only tokens that contribute to the top 30% of the probability
mass for the next token. This means lower top_p values limit the vocabulary of the model.
Adjusting these parameters makes them versatile for a wide range of applications. For instance, for a
creative-writing task, a temperature of 1.3 (while also playing around with top_p) could be good, while
for a code-generation task, 0.4 (again, while also experimenting with top_p) would be better.
With a temperature of 2, the prompt returns something like the following (which is grammatically
incorrect):
When attempting to have a large language model (LLM) output that reflects the desired outcome,
there are numerous alternatives to ponder. adapting the prompt itself is called one approach
that credits sustainability. Grounding on hyperparameters programs appertains settings amidst
the indicators simultaneously participates confidentiality impact later goals aren abbreviated
instructions. Otherwise instructions brings alternative ramification which doesn omit lengthy
resolution reaching goals which aren susceptible to evaluations constantly... likelihood
advancement adjustment exemplark.
To achieve better alignment between the output of a language model (LLM) and the desired
outcome, there are various options to explore. One option is to adjust the prompt, while another
involves tweaking the model's hyperparameters.
In the middle, with a temperature of 0.8, the result looks like this:
To achieve a greater degree of alignment between the output of a language model (LLM) and the
intended result, there are various approaches to consider. One tactic involves adjusting the
prompt directly, whereas an alternative involves manipulating the model's hyperparameters.
The frequency penalty is applied to tokens that have already been mentioned in the preceding text
(including the prompt). It is scaled based on the number of times the token has appeared. For example,
a token that has appeared five times receives a higher penalty, reducing its likelihood of appearing
again, than a token that has appeared only once. The presence penalty, on the other hand, applies a
penalty to tokens regardless of their frequency. Once a token has appeared at least once before, it will
be subject to the penalty. The range of acceptable values for both is from –2 to 2.
To enhance the accuracy of a large language model's (LLM) output to meet the desired result,
there are various strategies to explore. One method involves adjusting the prompt itself,
whereas another entails manipulating the model's hyperparameters.
There are various options to consider when attempting to better align the output of a language
model (LLM) with the desired outcome. One option is to modify the prompt, while another is to
adjust the model's hyperparameters.
Note The model is limited by its inner structure. For instance, GPT-4 is limited to a max
number of 32,768 tokens, including the entire conversation and prompts, while GPT-4-turbo
has a context window of 128k tokens.
The model will likely generate France. If a full stop (.) is designated as the stop sequence, the model
will cease generating text when it reaches the end of the first sentence, regardless of the specified
token limit.
A more complex example can be built with a few-shot approach, which uses a pair of angled
brackets (<< … >>) on each end of a sentiment. Considering the following prompt:
By using stop sequences strategically within prompts, developers can ensure that the model gener-
ates text up to a specific point, preventing it from producing unnecessary or undesired information.
This technique proves particularly useful in scenarios where precise and limited-length responses are
desired, such as when generating short summaries or single-sentence outputs.
In this section, you’ll focus on OpenAI models, like GPT-4, GPT-3.5-turbo, and their predecessors.
(Other chapters might use different models.) For these examples, .NET and C# will be used mainly, but
Python will also be used at some point.
In general, Azure OpenAI Services enable Azure customers to use those advanced language AI
models, while still benefiting from the security and enterprise features offered by Microsoft Azure, such
as private networking, regional availability, and responsible AI content filtering.
At first, directly accessing OpenAI could be the easiest choice. However, when it comes to enterprise
implementations, Azure OpenAI is the more suitable option due to its alignment with the Azure
platform and its enterprise-grade features.
To get started with Azure OpenAI, your Azure subscription must include access to Azure OpenAI,
and you must set up an Azure OpenAI Service resource with a deployed model.
If you choose to use OpenAI directly, you can create an API key on the developer site (https://
platform.openai.com/ ).
In terms of technical differences, OpenAI uses the model keyword argument to specify the desired
model, whereas Azure OpenAI employs the deployment_id keyword argument to identify the specific
model deployment to use.
■■ The Chat Completion API supports GPT-4-turbo, GPT-4, GPT-4-0314, GPT-4-32k, GPT-4-
32k-0314, GPT-3.5-turbo, and GPT-3.5-turbo-0301.
■■ The Completion API includes older (but still good for some use cases) models, such as
text-davinci-003, text-davinci-002, text-curie-001, text-babbage-001, and text-ada-001.
One advantage of the Chat Completion API is the role selection feature, which enables users to
assign roles to different entities in the conversation, such as user, assistant, and, most importantly,
system. The first system message provides the model with the main context and instructions “set in
stone.” This helps in maintaining consistent context throughout the interaction. Moreover, the system
message helps set the behavior of the assistant. For example, you can modify the personality or tone of
the assistant or give specific instructions on how it should respond. Additionally, the Chat Completion
API allows for longer conversational context to be appended, enabling a more dynamic conversation
flow. In contrast, the Completion API does not include the role selection or conversation formatting
features. It takes a single prompt as input and generates a response accordingly.
Both APIs provide finish_reasons in the response to indicate the completion status. Possible
finish_reasons values include stop (complete message or a message terminated by a stop
sequence), length (incomplete output due to token limits), function_call (model calling a func-
tion), content_filter (omitted content due to content filters), and null (response still in progress).
Although OpenAI recommends the Chat Completion API for most use cases, the raw Completion
API sometimes offers more potential for creative structuring of requests, allowing users to construct
their own JSON format or other formats. The JSON output can be forced in the Chat Completion API by
using the JSON mode with the response_format parameter set to json_object.
To summarize, the Chat Completion API is a higher-level API that generates an internal prompt
and calls some lower-level API and is suited for chat-like interactions with role selection and conversa-
tion formatting. In contrast, the Completion API is focused on generating completions for individual
prompts.
It’s worth mentioning that the two APIs are to some extent interchangeable. That is, a user can force
the format of a Chat Completion response to reflect the format of a Completion response by construct-
ing a request using a single user message. For instance, one can translate from English to Italian with
the following Completion prompt:
[{"role": "user", "content": 'Translate the following English text to Italian: "{input}"'}]
Setting things up in C#
You can now set things up to use Azure OpenAI API in Visual Studio Code through interactive .NET
notebooks, which you will find in the source code that comes with this book. The model used is
GPT-3.5-turbo. You set up the necessary NuGet package—in this case, Azure.AI.OpenAI—with the
following line:
using System;
using Azure.AI.OpenAI;
var AOAI_ENDPOINT = Environment.GetEnvironmentVariable("AOAI_ENDPOINT");
var AOAI_KEY = Environment.GetEnvironmentVariable("AOAI_KEY");
var AOAI_DEPLOYMENTID = Environment.GetEnvironmentVariable("AOAI_DEPLOYMENTID");
var AOAI_chat_DEPLOYMENTID = Environment.GetEnvironmentVariable("AOAI_chat_DEPLOYMENTID");
var endpoint = new Uri(AOAI_ENDPOINT);
var credentials = new Azure.AzureKeyCredential(AOAI_KEY);
var openAIClient = new OpenAIClient(endpoint, credentials);
var completionOptions = new ChatCompletionsOptions
{
DeploymentName=AOAI_DEPLOYMENTID,
MaxTokens=500,
Temperature=0.7f,
FrequencyPenalty=0f,
PresencePenalty=0f,
NucleusSamplingFactor=1,
StopSequences={}
};
var prompt =
@"rephrase the following text: <<<When aiming to align the output of a language model (LLM)
more closely with the desired outcome, there are several options to consider. One approach
involves modifying the prompt itself, while another involves working with hyperparameters of the
model>>>";
After running this code, one possible output displayed in the notebook is as follows:
There are various ways to bring the output of a language model (LLM) closer to the intended
result. One method is to adjust the prompt, while another involves tweaking the model's
hyperparameters.
import os
import openai
from openai import AzureOpenAI
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file
client = AzureOpenAI(
azure endpoint = os.getenv("AZURE OPENAI ENDPOINT"),
api key=os.getenv("AZURE OPENAI KEY"),
openai.api_version="2023-09-01-preview"
)
deployment_name=os.getenv("AOAI_DEPLOYMENTID")
context = [ {'role':'user', 'content':"rephrase the following text: 'When aiming to align the
output of a language model (LLM) more closely with the desired outcome, there are several
options to consider: one approach involves modifying the prompt itself, while another involves
working with hyperparameters of the model.'"} ]
response = client.chat.completions.create(
model=deployment_name,
messages=context,
temperature=0.7)
response.choices[0].message["content"]
This is based on OpenAI Python SDK v.1.6.0, which can be installed via pip install openai.
Basic techniques
Prompt engineering involves understanding the fundamental behavior of LLMs to construct prompts
effectively. Prompts consist of different components: instructions, primary content, examples, cues,
and supporting content (also known as additional context or knowledge). Instructions guide the model
on what to do, while primary content is the main text being processed. Examples provide desired
behavior demonstrations, while cues act as a jumpstart for the model’s output. Supporting content
This section covers basic techniques for mastering the art of prompting.
Zero-shot scenarios
Whenever a task, assigned to a model through a prompt, is given without any specific example of the
desired output, it’s called zero-shot prompting. Basic scenarios might include:
Note Clear prompts might not be short. In many situations, longer prompts provide more
clarity and context.
A few examples
A basic example of a zero-shot prompt might look like this:
Determine at most three topics that are being discussed in the following text, delimited by
triple backticks.
Iterative refining
Prompt engineering is a matter of refining. Trying to improve the preceding result, you might want to
explicitly list the sentiment the model should output and to limit the output to the sentiment only. For
example, a slightly improved prompt might look like the following:
Extract sentiment (positive, neutral, negative, unknown) from the following text, delimited by
triple backticks.
'''Language models have revolutionized the way we interact with technology, empowering us to
generate creative content, explore new ideas, and enhance our communication. LLMs offer immense
potential for unlocking innovation and improving various aspects of our lives, opening up
exciting possibilities for the future.'''
Return only one word indicating the sentiment.
Positive
Likewise, regarding the topic extraction, you might want only one or two words per topic, each
separated by commas:
Determine at most three topics that are being discussed in the following text, delimited by
triple backticks.
Format the response as a list of at most 2 words, separated by commas.
'''Language models have revolutionized the way we interact with technology, empowering us to
generate creative content, explore new ideas, and enhance our communication. LLMs offer immense
potential for unlocking innovation and improving various aspects of our lives, opening up
exciting possibilities for the future.'''
A few-shot prompt consists of several examples, or shots, which condition the model to generate
responses in subsequent instances. While a single example may suffice for basic tasks, more challenging
scenarios call for increasing numbers of demonstrations.
When using the Chat Completion API, few-shot learning examples can be included in the system
message or, more often, in the messages array as user/assistant interactions following the initial system
message.
Note Few-shot prompting is useful if the accuracy of the response is too low. (Measuring
accuracy in an LLM context is covered later in the book.)
This approach focuses on priming the model for inference within specific conversations or contexts.
It has demonstrated competitive performance compared to fine-tuned models in tasks like translation,
question answering, word unscrambling, and sentence construction. However, the inner workings of
in-context learning and the contributions of different aspects of shots to task performance remain less
understood.
Recent research has shown that ground truth demonstrations are not essential, as randomly replac-
ing correct labels has minimal impact on classification and multiple-choice tasks. Instead, other aspects
of demonstrations, such as the label space, input text distribution, and sequence format, play crucial
roles in driving performance. For instance, the two following prompts for sentiment analysis—the first
with correct labels, and the second with completely wrong labels —offer similar performance.
In-context learning may struggle with tasks that lack precaptured input-label correspondence. This
suggests that the intrinsic ability to perform a task is obtained during training, with demonstrations
(or shots) primarily serving as a task locator.
A few examples
One of the most famous examples of the efficiency of few-shot learning prompts is one taken from a
paper by Brown et al. (2020), where the task is to correctly use a new word in a sentence:
A "whatpu" is a small, furry animal native to Tanzania. An example of a sentence that uses
the word whatpu is: We were traveling in Africa and we saw these very cute whatpus. To do a
"farduddle" means to jump up and down really fast. An example of a sentence that uses the word
farduddle is:
We were so excited that we couldn't help but farduddle when our favorite team won the
championship.
A very good use case for few-shot learning is writing something in a given style, such as code docu-
mentation or social media posts. Whenever there is a pattern or a format, and explaining it is more
complex than showing it, it’s worth trying few-shot prompting. For instance, the following prompt
would generate a consistent documentation:
/// <summary>
/// Returns detailed information about a specific event
/// </summary>
/// <param name="org">Name of the organization</param>
/// <param name="year">Year of the event</param>
/// <param name="tournamentId">Internal ID of the event</param>
/// <returns>Detailed info about the event</returns>
/// <remarks>Sample events to try are: 2892, 2331, 1028. Try also 8392.
/// If any of the parameters are invalid, the endpoint returns 404.</remarks>
public IActionResult Tournament(string org, int? year, string eventId)
Based on the model version and the API (Completion or Chat Completion), the system might tend to
explain the code and the documentation instead of producing new documentation for the presented
Iterative refining
The last prompt is likely to generate a similar response:
/// <summary>
/// Returns detailed information about a specific event.
/// </summary>
/// <param name="org">The name of the organization.</param>
/// <param name="year">The year of the event.</param>
/// <param name="tournamentId">The internal ID of the event.</param>
/// <returns>Detailed information about the event.</returns>
/// <remarks>
/// Sample event IDs to try are: 2892, 2331, 1028. You can also try 8392.
/// If any of the parameters are invalid, the endpoint returns a 404 error.
/// </remarks>
public IActionResult Event(string org, int? year, string eventId) {
// Implementation code here
}
/// <summary>
/// Retrieves information about a specific team.
/// </summary>
/// <param name="org">The name of the organization.</param>
/// <param name="teamId">The ID of the team.</param>
/// <returns>Information about the team.</returns>
public IActionResult Team(string org, string teamId) {
// Implementation code here
}
The Event method is an action that returns detailed information about a specific event. It takes
three parameters: org (representing the name of the organization), year (representing the year
of the event, nullable), and eventId (representing the internal ID of the event). This method
should be called to retrieve detailed information about an event.
The Team method is another action that retrieves information about a specific team. It takes two
parameters: org (representing the name of the organization) and teamId (representing the ID of
the team). This method should be called to fetch information about a team.
As is easily observable from the previous examples, all these generative models want to write text.
This is reasonable because they were trained to predict the next token. It is very likely, then, that a good
prompt needs to incorporate something to restrict the output.
Your task is to generate code comments in a consistent style with the reference. The response
must include only the comments for the new method, without any additional explanation or
comment.
Style reference:
/// <summary>
/// Returns detailed information about a specific event
/// </summary>
/// <param name="org">Name of the organization</param>
/// <param name="year">Year of the event</param>
/// <param name="tournamentId">Internal ID of the event</param>
/// <returns>Detailed info about the event</returns>
/// <remarks>Sample events to try are: 2892, 2331, 1028. Try also 8392.
/// If any of the parameters are invalid, the endpoint returns 404.</remarks>
public IActionResult Tournament(string org, int? year, string eventId)
New method:
/// <summary>
/// Returns detailed information about a specific team.
/// </summary>
/// <param name="org">Name of the organization</param>
/// <param name="teamId">ID of the team</param>
/// <returns>Detailed info about the team</returns>
Please note that this is the code comment for the new Team method. It retrieves detailed
information about a specific team. The method takes two parameters: org, which represents the
name of the organization, and teamId, which is the ID of the team. The method returns detailed
information about the team.
At this point, to prevent the model from producing additional text, you might need a different
strategy: asking the model to check if certain conditions on the output are satisfied. This can be done
by appending this line to the former prompt:
Check if the output contains additional text and, if so, remove it.
Asking the model to check if certain conditions in the input are satisfied is a very useful technique. It
can also be exploited for more standard tasks, such as form or json/xml/html validation.
In this case, you also tried to validate the output text. This is more of a trick than a technique
because the model doesn’t really produce the full output to be validated. Still, it works as a guardrail.
A better way to achieve the same result would have been to add one more API call with the former
prompt or, as explored later in book, involving a framework like Microsoft Guidance or Guardrails AI.
Considering this, it’s important to stress that these models work better when they are told what they
need to do instead of what they must avoid.
Note Later models such as GPT-4 perform noticeably better on logical problems, even with
simple non-optimized prompts.
When few-shot prompting proves insufficient, it may indicate the need for fine-tuning models (if
these are an option, which they aren’t for GPT-4 and GPT-4-turbo) or exploring advanced prompting
techniques. One such technique is chain-of-thought (CoT) prompting. You use CoT prompting to track
down all the steps (thoughts) performed by the model to draw the solution.
As presented in the work of Wei et al. (2022), this technique gives the model time to think, enhanc-
ing reasoning abilities by incorporating intermediate reasoning steps. When used in conjunction with
few-shot prompting, it leads to improved performance on intricate tasks that demand prior reasoning
for accurate responses.
Note The effectiveness of CoT prompting is observed primarily when employed with
models consisting of approximately 100 billion parameters. Smaller models tend to generate
incoherent chains of thought, resulting in lower accuracy compared to standard prompting.
The performance improvements achieved through CoT prompting generally scale with the
size of the model.
The basic idea behind CoT is to instruct the model to take time to think, allowing it to reconstruct
the reasoning steps during response generation. This simple idea significantly enhances the model’s
ability to perform complex tasks. This makes sense when you consider how language models produce
responses. They attempt to predict the next token, so the more intermediate “reasoning” tokens they
generate, the more the model needs to establish coherence among them and, ideally, provide a correct
response.
Note By only asking the model for the final answer, you leave limited room for the model to
verify the coherence between the question (prompt) and its response; in contrast, explicitly
outlining all the steps helps the model find the logical thread.
I bought 20 pens, gave 12 pens to my son, 3 to my daughter, 1 to my mom. I then bought 4 more
pens and lost 1.
Output the number of pens I have now.
Regarding logical problems, temperature and model version definitely matter. Lower temperature
will keep things more deterministic, while the latest models will have better accuracy, because they
have been trained on a larger dataset for logical problems.
With a temperature of 0.2, a GPT-3.5-turbo model will provide an answer like the following, whereas
a GPT-4 model will probably say, “You currently have 7 pens”:
Note Older models would have probably responded with the number only, without any
reasoning. Newer models have been reinforced through step-by-step prompting.
The answer is wrong. Worse, the answer is incoherent. It says 8, while it correctly calculates 7.
Let’s try the following prompt to see if it provides more reasonable output:
I bought 20 pens, gave 12 pens to my son, 3 to my daughter, 1 to my mom. I then bought 4 more
pens and lost 1.
Output the number of pens I have now and make sure the answer is correct and coherent. Make sure
the answer and the explanation are correct and coherent before answering.
After buying 20 pens, I had 20 pens. After giving 12 pens to my son, I had 8 pens. After giving
3 pens to my daughter, I had 5 pens. After giving 1 pen to my mom, I had 4 pens. After buying 4
more pens, I had 8 pens. After losing 1 pen, I had 7 pens.
Therefore, I have 7 pens now.
That’s correct.
I bought 20 pens, gave 12 pens to my son, 3 to my daughter, 1 to my mom. I then bought 4 more
pens and lost 1.
Output the number of pens I have now. Let's think it step by step.
This is called a zero-shot chain-of-thought prompt; the magic sentence, “Let’s think it step by step,”
forces the model to take time to think and to write down all the steps.
At this point, to get the final answer, you could ask the model to produce a structured output or
make one more API call with a simple prompt like, “Extract only the final answer from this text”:
I bought 20 pens, gave 12 pens to my son, 3 to my daughter, 1 to my mom. I then bought 4 more
pens and lost 1.
Output the number of pens I have now. Let's think it step by step. Output a json with:
explanation (string) and result (int).
{"explanation":"Initially, I had 20 pens. After giving 12 to my son, I had 8 left. Then, I gave
3 to my daughter, leaving me with 5. Giving 1 to my mom left me with 4 pens. Buying 4 more pens
gave me a total of 8 pens. Unfortunately, I lost 1 pen, leaving me with a final total of 7
pens.","result":7}
Possible extensions
Combining the few-shot technique with the chain-of-thought approach can give the model some
examples of step-by-step reasoning to emulate. This is called few-shot chain-of-thought. For instance:
Which is the more convenient way to reach the destination, balancing costs and time?
Option 1: Take a 20-minute walk, then a 15-minute bus ride (2 dollars), and finally a 5-minute
taxi ride (15 dollars).
Option 2: Take a 30-minute bike ride, then a 10-minute subway ride (2 dollars), and finally a
5-minute walk.
An extension of this basic prompting technique is Auto-CoT. This basically leverages the few-shot
CoT approach, using a prompt to generate more samples (shots) of reasoning, which are then concat-
enated into a final prompt. Essentially, the idea is to auto-generate a few-shot CoT prompt.
A more sophisticated approach to tree of thoughts requires writing some more code, with differ-
ent prompts running (maybe also with different temperatures) and producing reasoning paths. These
paths are then evaluated by another model instance with a scoring/voting prompt, which excludes
wrong ones. At the end, a certain mechanism votes (for coherence or majority) for the correct answer.
A few more emerging but relatively easy-to-implement prompting techniques are analogical
prompting (by Google DeepMind), which asks the model to recall a similar problem before solving
the current one; and step-back prompting, which prompts the model to step back from the specific
instance and contemplate the general principle at hand.
Chatbots
Chatbots have been around for years, but until the advent of the latest language models, they were
mostly perceived as a waste of time by users who had to interact with them. However, these new
models are now capable of understanding even when the user makes mistakes or writes poorly, and
they respond coherently to the assigned task. Previously, the thought of people who used chatbots was
almost always, “Let me talk to a human; this bot doesn’t understand.” Soon, however, I expect we will
reach something like the opposite: “Let me talk to a chatbot; this human doesn’t understand.”
System messages
With chatbots, system messages, also known as metaprompts, can be used to guide the model’s behav-
ior. A metaprompt defines the general guidelines to be followed. Still, while using these templates and
guidelines, it remains essential to validate the responses generated by the models.
A good system prompt should define the model’s profile, capabilities, and limitations for the specific
scenario. This involves:
■■ Specifying how the model should complete tasks and whether it can use additional tools
■■ Determining the desired posture and tone for the model’s responses
■■ Defining the output format, including language, syntax, and any formatting preferences
■■ Providing examples to demonstrate the model’s intended behavior, considering difficult use
cases and CoT reasoning
Collecting information
Suppose you want to build a booking chatbot for a hotel brand group. A reasonable system prompt
might look something like this:
You are a HotelBot, an automated service to collect hotel bookings within a hotel brand group,
in different cities.
You first greet the customer, then collect the booking, asking the name of the customer, the
city the customer wants to book, room type and additional services.
You wait to collect the entire booking, then summarize it and check for a final time if the
customer wants to add anything else.
You ask for arrival date, departure date, and calculate the number of nights. You ask for a
passport number. Make sure to clarify all options and extras to uniquely identify the item from
the pricing list.
You respond in a short, very conversational friendly style. Available cities: Rome, Lisbon,
Bucharest.
Extra services:
parking 20.00 per day,
late checkout 100.00
airport transfer 50.00
SPA 30.00 per day
Consider that the previous prompt is only a piece of a broader application. After the system mes-
sage is launched, the application should ask the user to start an interaction; then, a proper conversation
between the user and chatbot should begin.
For a console application, this is the basic code to incorporate to start such an interaction:
Note When dealing with web apps, you must also consider the UI of the chat.
Return a json summary of the previous booking. Itemize the price for each item.
The json fields should be
1) name,
2) passport,
3) city,
4) room type with total price,
5) list of extras including total price,
6) arrival date,
7) departure date,
8) total days
9) total price of rooms and extras (calculated as the sum of the total room price and extra
price).
Return only the json, without introduction or final sentences.
Simulating a conversation with the HotelBot, a json like the following would be generated from
the previous prompt:
{"name":"Francesco Esposito","passport":"XXCONTOSO123","city":"Lisbon","room_type":{"single":15
0.00},"extras":{"parking":{"price_per_day":20.00,"total_price":40.00}},"arrival_date":"2023-06-
28","departure_date":"2023-06-30","total_days":2,"total_price":340.00}
Return a text summary from the following json, using a friendly style. Write at most two
sentences.
{"name":"Francesco Esposito","passport":"XXCONTOSO123","city":"Lisbon","room_type":{"single":150.
00},"extras":{"parking":{"price_per_day":20.00,"total_price":40.00}},"arrival_date":"2023-06-28",
"departure_date":"2023-06-30","total_days":2,"total_price":340.00}
Francesco Esposito will be staying in Lisbon from June 28th to June 30th. He has booked a single
room for $150.00 per night, and the total price including parking is $340.00 for 2 days.
Translating
Thanks to pretraining, one task that LLMs excel at is translating from a multitude of different languages—
not just natural human languages, but also programming languages.
SELECT
This prompt is a classic example of a plain completion (so, Completion API). The last part (SELECT)
acts as cue, which is the jumpstart for the output.
In a broader sense, within the context of Chat Completion API, the system prompt could involve
providing the database schema and asking the user which information to extract, which can then be
translated into an SQL query. This type of prompt generates a query that the user should execute on
the database only after assessing the risks. There are other tools to interact directly with the database
through agents using the LangChain framework, discussed later in this book. These tools, of course,
come with risks; they provide direct access to the data layer and should be evaluated on a case-by-case
basis.
<<<{message1}>>>
3. One sends a message to the other. Let’s suppose User1 writes a message {message1} in
{user1Language}.
7. The middleware performs the same job and sends the message to User1.
8. And so on….
LLM limitations
So far, this chapter has focused on the positive aspects of LLMs. But LLMs have limitations in several
areas:
■■ LLMs struggle with accurate source citations due to their lack of internet access and limited
memory. Consequently, they may generate sources that appear reliable but are incorrect (this is
called hallucination). Strategies like search-augmented LLMs can help address this issue.
■■ LLMs tend to produce biased responses, occasionally exhibiting sexist, racist, or homophobic
language, even with safeguards in place. Care should be taken when using LLMs in consumer-
facing applications and research to avoid biased results.
■■ LLMs often generate false information when faced with questions on which they have not been
trained, confidently providing incorrect answers or hallucinating responses.
■■ Without additional prompting strategies, LLMs generally perform poorly in math, struggling
with both simple and complex math problems.
Summary
This chapter explored various basic aspects of prompt engineering in the context of LLMs. It covered
common practices and alternative methods for altering output, including playing with hyperparam-
eters. In addition, it discussed accessing OpenAI APIs and setting things up in C# and Python.
Next, the chapter delved into basic prompting techniques, including zero-shot and few-shot
scenarios, iterative refining, chain-of-thought, time to think, and possible extensions. It also examined
basic use cases such as booking chatbots for collecting information, summarization, and transforma-
tion, along with the concept of a universal translator.
Finally, the chapter discussed limitations of LLMs, including generating incorrect citations,
producing biased responses, returning false information, and performing poorly in math.
234
encoder/decoder
235
encryption
236
method
237
method
238
speech recognition
239
SQL
240