Phi 2, developed by Microsoft, is making waves in the world of LLM. Unlike its larger counterparts, Phi 2 is intentionally designed to be compact, with "only" 2.7 billion parameters, yet it manages to hold its own in various language model evaluations. But can it be effectively used for Retrieval-augmented Generation (RAG)? Are there any trade-offs to consider?
In this experiment, I took Pico Jarvis, a simplified RAG implementation comprising just a few hundred lines of code, and made it compatible with Phi 2. Keep in mind that this isn't a production system; rather, it serves as an illustrative example rather than a groundbreaking scientific discovery. The experiment's verdict? Yes, Phi 2 shows promise as a language model for RAG, with some necessary adjustments. If you're not already familiar, check out the backstory of Pico Jarvis in Part 1, 2, and 3 to understand its basic features and capabilities.
Initially, Pico Jarvis was designed to work with Mistral, specifically the OpenOrca variant, which proved effective for RAG, even when running in a 4-bit quantized mode. The prospect of transitioning to Phi 2 is intriguing because it could significantly enhance performance, potentially enabling the entire system to run smoothly on more modest hardware, rather than a dedicated rig with a heavyweight GPU.
Honestly, I've been eager to integrate any "small" language model with Pico Jarvis for a while. There was the Orca Mini 3B, based on OpenLlama 3B and fine-tuned with the techniques in the Orca 2 research paper, which showed promise but fell short when used for RAG. However, Phi 2 has broken the barrier for 3B-and-RAG compatibility. I believe we'll see more open-source Small Language Models (SLMs) fine-tuned and customized for RAG in the future.
From Mistral to Phi
First, let's compare the speed of 5-bit Mistral and Phi 2 on a MacBook Air M2 (8 GB). Notably, it wasn't the latest MacBook model. The chart also provides a reference for the speed on a Ryzen 5 4560GE, an even older (and weaker) mobile CPU. Yes, running Mistral on that ancient Ryzen processor is not for the faint of heart.
BTW, we chose 5-bit quantization for this comparison because I found that 4-bit quantization of Phi 2 didn't quite fit the type of RAG implemented in Pico Jarvis. It might be possible to make 4-bit work with some prompt tweaks, but for now, we're sticking with Q5_K_M. Just for your reference, the weight file sizes for Phi 2 and Mistral are 1.9 GB and 4.8 GB, respectively.
Speaking of prompt tweaks, that was the first step in getting RAG to work with Phi 2. Initially, the prompts were tailored for Mistral. As you can see in this commit, which is also summarized below, Phi 2 requires a more assertive prompt variant to ensure it follows instructions more faithfully.
This situation highlights the challenge of working with prompt-driven LLM applications. Crafting an effective prompt, aka “prompt engineering”, can be a finicky process, especially when you need to support multiple different LLMs as back-ends.
The brittleness of prompt engineering led me to refactor the Reason-Act pattern implementation in Pico Jarvis. For more details on that, check out my previous article. Converting free-form text generated by an LLM into a semi-structured format can be quite messy, given the many potential non-standard paths. Therefore, some error handling is inevitably added.
I have a hunch that in the future, any intermediate response from an LLM intended for further processing, rather than a final user answer, will need to be in a structured format (JSON being the likely choice). Combined with the prompt challenges, it's no surprise that LLM apps will increasingly be constructed as pipelines. This idea has gained traction in various LLM frameworks these days. For a deeper dive, I recommend checking out DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines by Omar Khattab and the Stanford NLP team (GitHub repository: stanfordnlp/dspy).
After addressing these two challenges—tweaking prompts and refactoring Reason-Act—Pico Jarvis can finally work seamlessly with Phi 2. This also means it's entirely possible to demonstrate Pico Jarvis on an average MacBook Air M2. A typical question that used to take 12 seconds with Mistral now takes only 8 seconds with Phi 2. Is it the fastest thing in the world? Certainly not. Is it usable to a reasonable extent? Absolutely. It means everything works locally, making it enjoyable even when you're offline.
Does Size Matter?
To be fair, there are some downsides.
Firstly, there's a slight drop in accuracy. I haven't had the chance to conduct a thorough evaluation yet. Occasionally, Pico Jarvis provides incorrect answers when the conversation history becomes convoluted, possibly due to the lengthening prompt. My current workaround is to clear the history, and I plan to experiment with some prompt compression tricks.
Another drawback is the loss of multilingual capabilities, especially for low-resource languages like Bahasa Indonesia. While Mistral itself was primarily optimized for English and other widely-used Western languages, it could still handle some Bahasa Indonesia questions reasonably well. On the other hand, Phi 2 struggles with Indonesian, quickly losing coherence in conversations.
While there may be more disadvantages to using a small LLM like Phi 2, these are the two glaring ones for now. As I continue to tinker and explore Phi 2, I'll be happy to share any significant shortcomings and potential workarounds.
Is Phi 2 my top choice for RAG right now? Probably not, unless you're operating in a resource-constrained environment (e.g., a mobile device with limited processing power). However, it's hard to ignore the fact that Small Language Models (SLM) are becoming increasingly sophisticated. In the near future, I'm confident that a hybrid approach will emerge: a fast and local SLM for quick reasoning, coupled with a remote and advanced LLM for refining responses progressively.
Stay tuned for an exciting 2024!