LLMs are Level 2 Automation
Full self-driving is a long way off, but there's a lot we can do today.
Large Language Models (LLMs) like ChatGPT are incredibly powerful and at times feel borderline magical, so it's easy to feel like they can do anything. Unfortunately, that's not the case. To use a self-driving car analogy, LLMs are currently more like Level 2 automation instead of full self-driving.
The Society of Automotive Engineers has split the level of driving automation across six levels from 0 to 5. Levels 0 through 2 are human-augmented driving: you are still driving the car, but there are features like lane assist and adaptive cruise control to help you. Augmented driving features are very convenient and make cars easier and safer to drive. Truly automated driving starts happening at level 3 and above—this is when the driver is not actually in control in certain circumstances. Level 5 is what we think of as full self-driving.
LLMs like ChatGPT are not there yet. While they seem incredibly human-like at times, they can't handle end-to-end scenarios consistently. This results in very strange behavior when you engage with them for longer periods of time. Chatting with these systems reveals their weaknesses quickly because human conversations easily jump across multiple domains and intents. Right now, to have a truly human-like conversation, you still need a human on the other side.
When you limit the problem domain, though, LLMs perform much better. Prompt engineering and fine-tuning are great when you have a specific problem to solve. There is also prompt tuning, in which you can generate special soft prompts that can outperform the few-shot learning used with traditional prompts. With these types of techniques, you can improve the accuracy in a specific problem space.
Even with these techniques, current LLMs are not good enough to trust the output 100% of the time. They are still prone to hallucinations and can make up factually incorrect responses in a very confident manner, especially if there is insufficient input information. This means that some level of human guidance is necessary before fully trusting the output. This is why LLMs are great for tasks that a human can review first, such as creative writing applications, content generation, or human-in-the-loop automations.
The challenge in utilizing LLMs effectively lies in how you can structure your application around their limitations. A product using an LLM must specify their tasks precisely, and the rest of the input text needs to be clear. Then, the output of the model should also be validated (preferentially by the user) before any high-stakes action is taken. For example, if you are using an LLM to help someone write an email, then your application should be built with affordances to help the user tell the model what they are expecting, and the user needs to have a chance to edit or alter the text before the email gets sent.
LLMs are amazing and game-changing tools, but they're not perfect yet. They’re well suited for Level 2 automation—where people are made much more efficient and creative but still ultimately in control. In the future, maybe we'll reach something like full self-driving with LLMs, but there's a long road ahead. In the meantime, there are plenty of incredible things that we can do right now, given that we keep their limitations in mind.
William Cheng is an engineering and product leader, and co-founder of Maestro AI, an AI-powered chief of staff that automates knowledge management for dev teams. Connect with him on LinkedIn.