Emily Cerf, UC Santa Cruz
Large language models such as ChaptGPT have proven to be able to produce remarkably intelligent results, but the energy and monetary costs associated with running these massive algorithms is sky high. It costs $700,000 per day in energy costs to run ChatGPT 3.5, according to recent estimates, and leaves behind a massive carbon footprint in the process.
In a new preprint paper, researchers from UC Santa Cruz show that it is possible to eliminate the most computationally expensive element of running large language models, called matrix multiplication, while maintaining performance. In getting rid of matrix multiplication and running their algorithm on custom hardware, the researchers found that they could power a billion-parameter-scale language model on just 13 watts, about equal to the energy of powering a lightbulb and more than 50 times more efficient than typical hardware.
Even with a slimmed-down algorithm and much less energy consumption, the new, open source model achieves the same performance as state-of-the-art models like Meta’s Llama.
“We got the same performance at way less cost — all we had to do was fundamentally change how neural networks work,” said Jason Eshraghian, an assistant professor of electrical and computer engineering at the Baskin School of Engineering and the paper’s lead author. “Then we took it a step further and built custom hardware.”
Understanding the cost
Until now, all modern neural networks, the algorithms that power large language models, have used a technique called matrix multiplication. In large language models, words are represented as numbers that are then organized into matrices. Matrices are multiplied by each other to produce language, performing operations that weigh the importance of particular words or highlight relationships between words in a sentence or sentences in a paragraph. Larger scale language models have trillions of these numbers.
“Neural networks, in a way, are glorified matrix multiplication machines,” Eshraghian said. “The larger your matrix, the more things your neural network can learn.”
For the algorithms to be able to multiply matrices together, the matrices need to be stored somewhere, and then fetched when it comes time to compute. This is solved by storing the matrices on hundreds of physically-separated graphics processing units (GPUs), which are specialized circuits designed to quickly carry out computations on very large datasets, designed by the likes of hardware giant Nvidia. To multiply numbers from matrices on different GPUs, data must be moved around, a process which creates most of the neural network’s costs in terms of time and energy.
Eliminating matrix multiplication
The researchers came up with a strategy to avoid using matrix multiplication using two main techniques. The first is a method to force all the numbers within the matrices to be ternary, meaning they can take one of three values: negative one, zero, or positive one. This allows the computation to be reduced to summing numbers rather than multiplying.
From a computer science perspective the two algorithms can be coded the exact same way, but the way Eshraghian’s team’s method works eliminates a ton of cost on the hardware side.
“From a circuit designer standpoint, you don't need the overhead of multiplication, which carries a whole heap of cost,” Eshraghian said.
This strategy was inspired by a paper produced by Microsoft that showed it was possible to use ternary numbers in neural networks, but did not go as far as to get rid of matrix multiplication, or open-sourcing their model to the public. To do this, the researchers adjusted the strategy of how the matrices communicate with each other.
Instead of multiplying every single number in one matrix with every single number in the other matrix, as is typical, the researchers devised a strategy to produce the same mathematical results. In this approach, the matrices are overlaid and only the most important operations are performed.
“It’s quite light compared to matrix multiplication,” said Rui-Jie Zhu, the paper’s first author and a graduate student in Eshraghian’s group. “We replaced the expensive operation with cheaper operations.”
Although they reduced the number of operations, the researchers were able to maintain the performance of the neural network by introducing time-based computation in the training of the model. This enables the network to have a “memory” of the important information it processes, enhancing performance. This technique paid off — the researchers compared their model to Meta’s state-of-the-art algorithm called Llama, and were able to achieve the same performance, even at a scale of billions of model parameters.
Custom chips
The researchers designed their neural network to operate on GPUs, as they have become ubiquitous in the AI industry, allowing the team’s software to be readily accessible and useful to anyone who might want to use it.
On standard GPUs, the researchers saw that their neural network achieved about 10 times less memory consumption and operated about 25 percent faster than other models. Reducing the amount of memory needed to run a powerful large language model could provide a path forward to enabling the algorithms to run at full capacity on devices with smaller memory like smartphones.
Nvidia, the dominant producer of GPUs worldwide, designs their hardware to be highly optimized to perform matrix multiplication, which has enabled them to dominate the industry and launched them to be one of the most profitable companies in the world. However, this hardware is not fully optimized for ternary operations.
To push the energy savings even further, the team collaborated with Assistant Professor Dustin Richmond and Lecturer Ethan Sifferman in the Baskin Engineering Computer Science and Engineering department to create custom hardware. Over three weeks, the team created a prototype of their hardware on a highly-customizable circuit called a field-programmable gate array (FPGA). This hardware enables them to take full advantage of all the energy-saving features they programmed into the neural network.
With this custom hardware, the model surpasses human-readable throughput, meaning it produces words faster than the rate a human reads, on just 13 watts of power. Using GPUs would require about 700 watts of power, meaning that the custom hardware achieved more than 50 times the efficiency of GPUs.
With further development, the researchers believe they can further optimize the technology for even more energy efficiency.
“These numbers are already really solid, but it is very easy to make them much better,” Eshraghian said. “If we’re able to do this within 13 watts, just imagine what we could do with a whole data center worth of compute power. We’ve got all these resources, but let’s use them effectively.”