paint-brush
The Hidden Power of "Cherry" Parameters in Large Language Modelsby@disproportionate

The Hidden Power of "Cherry" Parameters in Large Language Models

tldt arrow

Too Long; Didn't Read

Not all parameters in LLMs matter equally! This blog explores parameter heterogeneity and how some "cherry" parameters significantly impact performance. Learn how optimizing them can boost efficiency.

Companies Mentioned

Mention Thumbnail
Mention Thumbnail

Coin Mentioned

Mention Thumbnail
featured image - The Hidden Power of "Cherry" Parameters in Large Language Models
Disproportionate Techstack  HackerNoon profile picture
0-item

Authors:

(1) Wanyun Cui, Shanghai University of Finance and Economics, with equal contribution;

(2) Qianle Wang, Shanghai University of Finance and Economics, with equal contribution.

Abstract and 1 Introduction

2 Related Work

3 Quantifying the Impact of Parameters on Model Performance & 4. Unified Mixed-Precision Training

5 Prevalence of Parameter Heterogeneity in LLMs

6 Quantization Experiments and 6.1 Implementation Details

6.2 Effect of Base LLM Quantization

6.3 Effect of Chat LLM Quantization

6.4 Comparison of Parameter Selection Criteria, Conclusion, & References

Abstract

This paper reveals the phenomenon of parameter heterogeneity in large language models (LLMs). We find that a small subset of ”cherry” parameters exhibit a disproportionately large influence on model performance, while the vast majority of parameters have minimal impact. This heterogeneity is found to be prevalent across different model families, scales, and types. Motivated by this observation, we propose CherryQ, a novel quantization method that unifies the optimization of mixed-precision parameters. CherryQ identifies and preserves the critical cherry parameters in high precision while aggressively quantizing the remaining parameters to low precision. Extensive experiments demonstrate the effectiveness of CherryQ. CherryQ outperforms existing quantization approaches in terms of perplexity and downstream task performance. Notably, our 3-bit quantized Vicuna-1.5 exhibits competitive performance compared to their 16-bit counterparts. These findings highlight the potential of CherryQ for enabling efficient deployment of LLMs by taking advantage of parameter heterogeneity.

1. Introduction

The rapid development of large language models (LLMs) has increased the demand of efficient deployment in various environments [1, 23, 12, 2]. However, the parameter size poses significant challenges for GPU memory requirements. Quantization, which reduces the bit-width of model parameters, has emerged as a solution to alleviate memory constraints of LLM deployment [14, 13, 24, 25, 18].


Quantizing parameters from higher bits to the integer points in the lower-bit space inevitably perturbs the parameters from its optimum, leading to a degradation in performance (i.e. quantization error). To mitigate this error, various approaches have been proposed, such as iterative block-wise quantization ([15, 10, 8]), gradient approximation ([18, 3]), and low-rank approximation of perturbations ([9]). However, existing approaches still cause clear performance degradation, especially for extreme low bits (e.g. 3-bit).


Before investigating how to further mitigate the performance degradation, we raise a more fundamental question: To what extent can quantization errors be mitigated? Our study shows that the answer is more complex than expected. For the vast majority (> 99%) of parameters, their quantization errors are minimal and thus can be alleviated or ignored. Nonetheless, there exists a small subset of parameters (< 1%) for which the quantization errors are substantial and hard to mitigate.


Consider Figure 1a as an example. We show the scatter plot of impacts on quantization error when perturbing each individual parameters in a parameter matrix from LLaMA2-7b [23]. The derivation of impacts are detailed in § 3. As 99% of parameters are in the range of (0,0.1), a small subset of “cherry” parameters exhibit a disproportionately large influence in the range of (5,30), with 50-300 times greater than the maximum value of the remaining 99% parameters.


This phenomenon is not an isolated occurrence. We observed similar patterns across different LLM scales (Figure 1a1b), different LLM families, including Mistral [12] (Figure 1c) and Gemma [22] (Figure 1d), and both base models and chat models (Vicuna-1.5 [5] Figure 1e1f). The consistent


Figure 1: Scatter plot of parameter impacts in different LLMs. We randomly sampled 4096 parameters from the corresponding parameter matrix. Each point represents the impact of an individual parameter. Insets show the zoomed-in y-axis. The heterogeneity is found in different model scales (1a,1b), different model families (1c, 1d), and both base models and chat models (1e, 1f).


presence suggests that it is an inherent characteristic of LLMs. We will elaborate the evidences from a more macro view in § 5. Based on these findings, we introduce the phenomenon of parameter heterogeneity:


Parameter Heterogeneity in LLMs: Consider the impact of parameter perturbations on the model’s behavior, a small set of “cherry” parameters have a disproportionately large influence. In contrast, the vast majority of normal parameters have a minimal impact.


We emphasize that the parameter impact is heterogeneous, as it exhibits a more severe imbalance compared to the commonly observed imbalance in LLMs [18, 17, 6, 7]. As a contrast, in § 5, we demonstrate that the parameter impact is much more heterogeneous than the parameter magnitude, although the latter one is also found to be imbalanced [7, 18]. This stark difference highlights the significance of considering parameter impact when optimizing and quantizing LLMs.


The parameter heterogeneity poses new challenges for conventional quantization strategies. Although quantizing most normal parameters will cause little performance degradation, cherry parameters are highly sensitive to quantization-induced perturbations. However, existing strategies usually fail to preserve the delicate structure of these critical parameters [8, 18].


Addressing this challenge is not trivial. One straightforward way is to employ a mixed-precision quantization [13], representing cherry parameters and normal parameters with high and low precisions, respectively. However, the simultaneous optimization of both types of parameters becomes a major challenge. In the widely used GPTQ [8] approach and PTQ framework, the optimal values of the early quantized parameters may change as the cherry parameters are updated. However, parameters cannot be updated once they are quantized in PTQ, which limits the early quantized parameters from reaching their optimal values.


To address this challenge, we employ the quantization-aware training (QAT) framework to handle parameter heterogeneity. Our quantization optimize the mixed-precision parameters with a unified backpropagation. For the cherry parameters, we maintain their 16-bit representations and apply standard gradient descent. For the normal parameters, we apply an extra Straight-Through Estimator (STE) trick [3] for gradient descent. Therefore, all parameters are continuously and uniformly optimized. We denote the approach as CherryQ (Cherry parameter and Quantization-aware training).


Extensive experiments on different models and benchmarks demonstrate the efficacy of CherryQ. It consistently yields the lowest perplexity on most settings. Notably, our 3-bit Vicuna-1.5 model exhibit performance on par with the 16-bit counterpart on Vicuna-bench [5].


We believe that our work opens up new avenues for understanding and harnessing the complex interplay of parameters in LLMs. The heterogeneity offers a novel perspective to navigate the tradeoff between parameter efficiency and performance.


This paper is available on arxiv under CC BY 4.0 DEED license.