This is part 3 of our Domain-Specific AI series. This post features a step-by-step guide to building an AI engine that generates the content you need, with the style you want. It's easy, simple, and flexible, allowing you to setup a content generation pipeline in minutes.
TL;DR
You can develop a high-quality AI pipeline for domain-specific tasks quickly and easily.
Leveraging LLMs' in-context learning abilities means that little training data is required.
Utilizing embedding models and indexing, you have control over the structure, tone, and style of outputs.
And all of this can be achieved with only minutes of work.
Table of Contents
In part I and part II of this series, we reviewed the various options for adapting LLMs for domain-specific use cases. We discussed in-context learning and fine-tuning to understand how these methods work and what are the considerations for choosing one over the other.
In this post, we will dive into the concept of in-context learning. To keep things simple, in-context learning exemplifies the desired behavior from the model. It requires minimal training data for demonstration and avoids any model training complexities, making it a resource-efficient option. By the end of this post, you will have the skills to set up a complete content generation pipeline, and you can also find a fully functional Colab notebook with all the code discussed in this post for your convenience.
We'll be covering the following topics:
How to index and discover the right content examples
Calling the ChatGPI API to generate new content
Evaluating various strategies and their impact on the generated content
It's important to note that while we will predominantly use the ChatGPT API in this post, you have the flexibility to employ any commercial API, such as Anthropic, AI21, or even open-source models. The specific configuration is entirely up to your discretion.
Getting Started
This tutorial is centered around the creation of high-quality content in various contexts. Potential applications include summarizing documents with specific attributes, crafting marketing emails, and composing compelling advertisements. Our primary focus will be on the task of generating feature-bullets for Amazon product pages.
Feature-bullets are a brief summary of the product’s features presented in a scannable manner. We emphasize feature-bullets over product descriptions because they appear prominently at the top of the product page, immediately following the product title and images. This stands in contrast to the product description, which receives less initial attention as it's positioned in the lower third of the product page. Additionally, the keywords used in feature-bullets are indexed by the Amazon search algorithm, increasing discoverability for potential shoppers.
A quick note on the terminology used in this post: we focus on the concept of few-shot learning, also referred to as in-context learning. It's important to emphasize that there is no pre-training or fine-tuning of models in this process. Whenever we mention "training data", it refers to the prompt-response pairs of data that we feed to the model.
With that out of the way, it's time to build a content generation platform 🤓✍️
Data is Everything
The most critical component in our content generation platform is the data. This holds true in machine learning in general, and Language Model Models (LLMs) are no different. While modern commercial LLMs are highly capable and require relatively little training data, the quality of that data is paramount. Therefore, it's imperative to invest time and effort in constructing a high-quality training dataset, which will serve as the foundation for in-context prompting of the model. This means you should verify that the data aligns with your requirements in terms of style, length, and the specific attributes you desire in your generated outputs.
In the specific use-case demonstrated in this post, our goal is to create Amazon feature-bullets based solely on a short title and technical data. To ensure that the AI-generated content mirrors the desired structure and style, we rely on a carefully curated training dataset sourced from leading Amazon product pages. Although the dataset we utilize comprises approximately 200 training examples, in reality, this is more than sufficient. You'll find that only a handful of training samples are necessary for achieving your desired style and tone.
Loading the Data
The dataset we use is hosted on the 🤗 Hugging Face Hub, and can be easily loaded using their datasets library. As the initial step, you'll load both the training (few-shot) and test data, and then convert it into a pandas dataframe for smoother data handling.
Fetching the Right Data
To generate responses from prompts, we employ a few-shot context framework. There are various strategies to consider when choosing the examples to feed into the prompts. For instance, if we believe that the training set is homogeneous and its distribution resembles the prompts used during inference, we can opt to randomly select prompt-response pairs as examples. Alternatively, if the training dataset is categorized or tagged, we can choose examples based on these categories. In the use-case demonstrated in this post, each product falls under a specific category (e.g., smartwatches, office supplies), making it sensible to use few-shot data from products in the same category as the target product. Nevertheless, there are also more structured methods available.
Embedding and Indexing
A more advanced approach involves selecting training data that closely resembles the target data. In statistical terms, we aim for the training and inference data to have similar distributions. The practical implementation of this approach may vary depending on the specific scenarios. In our case, we seek training product titles that are comparable to the title of the target product. But how is this achieved in practice?
To achieve this, we leverage the powerful concept of data embedding. In a nutshell, embedding refers to the process of transforming various types of data (text, image, voice, and more) into vector representations. A well-performing embedding model is expected to to position similar data items in close proximity within the vector space, while keeping dissimilar items at a distance. For example, when we utilize an embedding model to map individual words, we expect "moon" and "lunar" to be situated closely, whereas the word "table" would be positioned farther away. In the context of the product titles we discussed earlier, LLMs enable us to embed complete sentences (and even entire documents, and more). If you learn more about embeddings, you can find more information here and here.
Once we've embedded the product titles from the training data, our next objective is to efficiently search for and retrieve this information. This is where vector stores come in handy. There are many libraries for this purpose, offering a wide range of structures, algorithms and more. Given our focus, we can avoid using low-level APIs, and we can leverage NLP libraries like Llamaindex and LangChain, which offer a more user-friendly API. These libraries also seamlessly bundle the embedding and indexing steps together. While Llamaindex offers enhanced flexibility and may be better suited for large-scale projects, in this tutorial we'll be using LangChain, with FAISS serving as the vector store backend.
Notice that product titles have been indexed using a Max Inner Product strategy. This choice aligns with the customary practice in NLP tasks of using cosine distance, which measures the angle between vectors rather than their Euclidean distance. The output from the OpenAI embedding API consists of normalized vectors, and in such cases, the inner product and cosine distance of vectors coalesces. One advantage of cosine distance is that it falls within the range of [-1, 1], where a value of 1 indicates identical vectors, 0 represents perpendicular vectors, and a distance of -1 signifies opposite vectors. In contrast, Euclidean distance can take on any non-negative value and is therefore not a "standardized" metric.
Data Retrieval
With the embedding and indexing phases completed, it's time to explore the functionality of our retrieval system. We begin by selecting a product title from the test set, embedding it, and conducting a search to identify the five most similar titles.
The product title we sampled is:
WeMo Smart Video Doorbell - Apple HomeKit Secure Video with HDR - Smart Home Products Video Doorbell Camera - Ring Doorbell for Security Camera System - WiFi Camera Doorbell w/ 223° FOV & 2-Way Audio
With the five most similar titles being:
Title: Brilliant Smart Home Control (3-Switch Panel) — Alexa Built-In & Compatible with Ring, Sonos, Hue, Google Nest...
Category: smart_home_products
Score: 0.8510
Title: Brilliant Smart Dimmer Switch (Light Almond) — Compatible with Alexa, Google Assistant, Apple HomeKit, Hue, LI...
Category: smart_home_products
Score: 0.8283
Title: Ousmile Smart Sunrise Hatch Alarm Clock, Fast Wireless Charger Intelligent Atmosphere Lamp, RGB Night Light Bl...
Category: smart_home_products
Score: 0.8276
Title: BWLLNI Lighted Vanity Mirror with Lights, Makeup Mirror with Storage Shelves, Vanity Mirror with Lights 12 Dim...
Category: smart_home_products
Score: 0.8094
Title: HAPPRUN Projector, 5G WiFi Bluetooth Projector, Native 1080P Portable Projector with Screen and Bag, Support 4...
Category: projector_mounts
Score: 0.8083
We observe an interesting outcome. The target product title belongs to the "smart_home_products" category, and remarkably, the top four results also fall within the same category. This is noteworthy considering that our training set encompassed a total of 14 distinct categories. This outcome illustrates the effectiveness of using embeddings in our retrieval process. Moreover, the target product title and the first three results share a common structure: `<BRAND NAME> Smart <PRODUCT NAME>`.
While the current results are undoubtedly relevant, there's a potential drawback in that they might appear overly homogeneous. To address this, we'll introduce another strategy for data retrieval known as Maximal Marginal Relevance (MMR).
But how does MMR work, and how does it differ from simply retrieving the most similar titles? Let's consider a scenario where we aim to retrieve K results, with K=5 in our example. In the initial step, the algorithm fetches K_init > K items, which represents the most similar results to the target. Subsequently, within this group of K_init items, we refine the selection based on their similarity to the single most similar item. This refinement is designed to strike a balance between relevance and diversity. Here's how you can implement MMR using LangChain:
Which yields the following result:
Title: Brilliant Smart Home Control (3-Switch Panel) — Al...
Category: smart_home_products
Score: 0.8510
Title: HDMI Video Capture Card, 4K HDMI to USB Capture Ca...
Category: computer_input_devices
Score: 0.8047
Title: VAULTEK Smart Station Home Centric Biometric Smart...
Category: smart_home_products
Score: 0.8006
Title: EIGIIS Smart Watch for Women 1.7" HD Waterproof Sm...
Category: smartwatches
Score: 0.7990
Title: Bone Conduction Speaker, True Wireless Speakers Mi...
Category: speakers
Score: 0.7952
Using MMR indeed diversified the results. Now we got only two items from the "smart_home_products" category, while still maintaining relevancy.
In the next section, we'll harness these product titles, technical details, and feature-bullets from the training set to generate fresh product features, all conditioned on their respective titles and technical details.
Content Generation Using ChatGPT API
Up to this point, we have developed a retrieval system on a small scale. Now, it's time to utilize the ChatGPT API for the actual content generation. The model's primary input is a conversation, organized as an array of messages, and its output is a model-generated response. Each message object is assigned a role, either "system," "user," or "assistant," and includes content. A conversation begins with an initial system message, followed by a sequence of alternating user and assistant messages. The model generates an assistant response in response to a user message. For effective few-shot prompting, it's best practice to include an assistant response following each user message.
Helper Functions
In order to efficiently perform this step, we will first define some functions for easier prompting.
To initiate a conversation, we utilize the `Conversation` class, which features an `add_message()` method. This method serves the purpose of appending user and assistant messages to the conversation. Additionally, the `display_conversation()` method is available to print conversations in a display-friendly format.
The `api_call()` function serves as a simple wrapper for OpenAI's chat completions API, simplifying the process of invoking it. Lastly, `USER_TXT` is a straightforward string that is formatted with the product title and technical details, providing a suitable prompt format for this specific content generation task.
Zero-Shot Example
We begin by testing an example using zero-shot prompting, which means providing instructions to the model without including examples that demonstrate the desired output structure.
To call the API, we use the `api_call()` function defined above. Here is the text generation result we got:
- Apple HomeKit Secure Video integration for enhanced security and privacy
- High Dynamic Range (HDR) technology for clear, detailed video footage
- Works seamlessly with other smart home products for a complete security camera system
- Wide 223° field of view (FOV) ensures full coverage of your doorstep and surroundings
- Two-way audio allows for easy communication with visitors or delivery personnel
- Easy wireless or wired connectivity options for flexible installation
- HD resolution provides crisp and sharp video quality
- Built-in night vision for clear image capture in low-light conditions
- Advanced image sensor for accurate motion detection and reliable alerts
We observe that the model performs well in structuring the details into a feature-bullet format, in line with its instructions. There's no denying that ChatGPT is a robust and capable model. However, the generated output falls short of meeting the standards required for an Amazon product page. The bullet points are highly technical, lack appeal, and fail to convey an engaging message that could attract potential buyers. To enhance the quality of the output, we'll need to offer the model additional context.
Few-Shot Example
State-of-the-art LLMs exhibit exceptional capabilities when it comes to the practice of few-shot prompting. In fact, OpenAI's announcement of the GPT-3 model was titled "Language Models are Few-Shot Learners". By offering examples to the model, users can gain influence over various aspects of the response, including style, tone, structure, and other features.
Let's revisit the example we used earlier, but this time, we'll incorporate few-shot data into the prompt. To assemble a suitable few-shot dataset, we'll use the retrieval system introduced earlier.
With a 2-shot prompting approach, here is the result we received from the model:
- APPLE HOMEKIT SECURE VIDEO: Connect your WeMo Smart Video Doorbell to Apple HomeKit Secure Video for enhanced security and convenience. View live video footage, receive notifications, and securely store recordings in iCloud.
- HIGH DEFINITION VIDEO & HDR: Capture clear and detailed video with the HD resolution and HDR (High Dynamic Range) technology. See every detail, even in challenging lighting conditions.
- 2-WAY AUDIO COMMUNICATION: Speak to visitors at your door with the built-in 2-way audio feature. Communicate with delivery personnel, guests, or potential intruders from anywhere using your smartphone or tablet.
- WIDE FIELD OF VIEW: The WeMo Smart Video Doorbell offers a wide 223° field of view, ensuring you can see a large area around your front door. Monitor your doorstep, driveway, and surrounding areas with ease.
- NIGHT VISION: Keep an eye on your front door even at night with the built-in night vision capability. See clear and detailed images, even in low-light conditions, for enhanced security and peace of mind.
- EASY INSTALLATION & SMART CONNECTIVITY: The WeMo Smart Video Doorbell can be easily installed with either wireless or wired connectivity options. Connect to your home's Wi-Fi network and start monitoring your front door from anywhere using the WeMo app.
And there you have it. With just two training examples from our dataset, the model has grasped the desired structure and style. The bullets are now correctly formatted with a capitalized prefix, and the output is not only well-structured but also engaging, offering more than just technical details.
In the next section, we'll analyze how this output compares to the original feature-bullets found on the Amazon product page.
Analysis
In this section, we will examine two outputs. Initially, we will compare a model-generated output to the original one, allowing us to better assess the quality of the results. Following that, to showcase the impact of few-shot context on the model's output, we will generate feature-bullets for the same target product, albeit with different context.
Original Bullets vs. Model Output
The table below compares the topics that appear in the original feature-bullets to those output by the ChatGPT API:
Topic | ChatGPT Output | Original Features |
---|---|---|
Compatibility with Apple HomeKit | ✓ | ✓ |
Video quality | ✓ | ✓ |
Video angle | ✓ | ✓ |
Night vision | ✓ | ✓ |
Audio communication | ✓ | ✗ |
WiFi connectivity | ✓ | ✓ |
Ease of installation | ✓ | ✗ |
We can observe that the output generated by ChatGPT covers all the topics included in the original feature-bullets on Amazon.com. It also introduces 'Audio Communication' and 'Installation' to the feature-bullets, both of which are absent in the original feature-bullets. Furthermore, ChatGPT's output places greater emphasis on video quality, whereas in the original features, this aspect is only briefly mentioned. On the other hand, the original bullets provide more extensive information about WiFi connectivity. Overall, ChatGPT's output outperforms the original one by highlighting more appealing features, and the overall text quality is superior.
Few-Shot Data Affect
The example we observed earlier utilized a 2-shot prompt, in which the context data was specifically chosen to be relevant to the target product. To emphasize the significance of the context we include during prompting, we revisit the same target product. However, this time, we deliberately incorporate a 2-shot context from the 'Office Supply' category, where the feature-bullets are short and concise. The resulting output from the API is as follows:
- SMART VIDEO DOORBELL: Allows you to monitor your front door from anywhere using your smartphone or tablet.
- APPLE HOMEKIT SECURE VIDEO: Enjoy the added security and convenience of Apple HomeKit compatibility, allowing you to integrate your doorbell with other smart home devices.
- HDR TECHNOLOGY: Provides clear and vibrant video footage, even in challenging lighting conditions.
- WIDE FIELD OF VIEW: Capture a wide-angle view of your front porch with the 223° field of view, ensuring you never miss a visitor.
- TWO-WAY AUDIO: Communicate with visitors in real-time using the built-in microphone and speaker, perfect for greeting guests or handling deliveries.
- NIGHT VISION: Feel secure at all times with the doorbell's night vision feature, allowing you to see and record activity even in low-light conditions.
- EASY INSTALLATION: Choose between wireless or wired connectivity options, making installation a breeze.
- INTEGRATED IMAGE SENSOR: The doorbell's image sensor ensures high-quality, detailed video footage.
- ENHANCED SECURITY: Keep your home and loved ones safe with the added security of a video doorbell system.
Interestingly, the model's output continues to cover similar topics as it did previously. However, lacking the appropriate context, we receive numerous concise bullets with less engaging content. In a sense, it bears resemblance to the zero-shot output, albeit now adopting a capitalized prefix structure and featuring somewhat improved content in the bullets.
Conclusion
This post has demonstrated the simplicity of creating a content generation system. All you really need is a small-scale, high-quality training dataset to serve as a reference, and the capabilities of LLMs will handle the rest for you.
While we've showcased a specific content use-case and built the system around the ChatGPT API, there's no necessity to confine yourself to these presets. To generate content for other use-cases, you only need to adjust the prompt template. Additionally, there's no need to restrict yourself to using OpenAI's services. With minor modifications, you can explore other providers or open-source solutions.
Lastly, it's essential to consider additional factors when designing such a system, including pricing, availability, and privacy. Open-source deployments that utilize smaller models (1B-3B parameters) as the backend offer significant advantages in these areas. While this backend design requires an additional step of fine-tuning a model, which this post avoided, the setup for similar contained tasks is not complex and offers substantial benefits in the long term. Stay tuned for our next post, where we'll provide a step-by-step guide on fine-tuning a small model.
Appendix - Data and Outputs
Sample product (Smart Video Doorbell) original feature-bullets
- COMPATIBLE WITH HOMEKIT SECURE VIDEO: With Apple HomeKit this smart video doorbell camera can send notifications to your Apple iPhone, helping you see who is at your door with face recognition. NOTE: Only compatible with a wired 16-24V AC doorbell system
- WIFI CAMERA WITH WIDE FIELD OF VIEW: Our home security camera has a super wide FOV measuring 178° vertical x 140° horizontal x 223° diagonal, so you’ll never miss a doorbell ring or nearby activity
- CLEAR PICTURE IN ANY LIGHTING: This smart home doorbell camera uses infrared technology to help get crisp video, even in the dark with low-light sensitivity and an HD camera
- DUAL WIFI BANDS BRING PEACE OF MIND: The 2.4GHz WiFi band offers a solid, long distance connection that easily penetrates walls, while the 5GHz wifi band offers greater speed while in closer range
- REVIEW VIDEO WITH EASE - With your existing iCloud storage plan and our HomeKit Secure Video enabled security camera, your 10-day motion-based recording history is securely stored and easily available to view whenever you need it
2-Shot context, affect check:
Title 1 | Early Buy Sticky Notes 6 Bright Color 6 Pads Self-Stick Notes 3 in x 3 in, 100 Sheets/Pad |
---|---|
Feature-bullets 1 | - 3 IN X: 3 in, 6 Pads / Pack, 100 Sheets / Pad, 6 colors, bright colors, easy to find out message what you write. - MEDIUM SIZE: easy to use, portable, bright color, making your message more noticeable, not easy to be ignored. - MADE WITH HIGH QUALITY PAPER: and adhesive, easy to use and peel, super sticky, removes cleanly. - USE FOR SCHOOL: office, family. Great for leaving notes or reminders on walls, doors, monitors, or other surfaces. |
Title 2 | Kinbashi 3 Ring Binder 1 Inch Binder for Letter Size Paper, Cute Binder Organizer for School Office Supplies, Black |
Feature-bullets 2 | - 1 INCH: 3 ring binder with unique design, simple and fashionable. It’s suitable for offices, students, teachers, school, and families. - MADE OF HIGH-QUALITY CARDBOARD: with full color printing, fashionable and durable. - EXPANDED SIZE: 21.9 x 12.4 inches, folded size: 12.4*10.2*1.5inches. - THIS 3 RING BINDER: comes with 2 interior pockets and includes 5 tab dividers and 18 labels stickers. - OUR 3 RING BINDER: is perfect for storing your draft papers, artworks, invoices, work orders and more. |
Kommentare