Structured Pruning for Large Language Models Using Coupled Components Elimination and Minor Fine-tuning
Large language models (LLMs) have demon-001 strated powerful capabilities in natural lan-002 guage processing, yet their vast number of pa-003 rameters poses challenges for deployment and 004 inference efficiency. Structured model pruning 005 emerges as a viable approach to reduce model 006 size and accelerate inference, without requir-007 ing specialized operators and libraries for de-008 ployment. However, structured pruning often 009 severely weakens the model’s capability. De-010 spite repetitive fine-tuning can restore the capa-011 bility to a certain extent, it impairs LLMs’ util-012 ity as versatile problem solvers. To address this 013 issue, we propose a novel structured pruning 014 algorithm tailored for LLMs. It derives the im-015 portance of different components, namely rows 016 and columns in parameter matrices, based on in-017 termediate data dependencies. Then it removes 018 coupled components across different layers si-019 multaneously and preserves dependency rela-020 tionships within remaining parameters, avoid-021 ing significant performance degradation. The 022 pruned model requires only few epochs of fine-023 tuning to restore its performance, ensuring the 024 model’s ability to generalize. Empirical eval-025 uations on LLaMA, Vicuna, and ChatGLM3 026 demonstrate our algorithm’s efficacy, yielding 027 20% parameter reduction while retaining at 028 least 94.4% of original performance metrics. 029