TY - GEN
T1 - A 28nm 192.3TFLOPS/W Accurate/Approximate Dual-Mode-Transpose Digital 6T-SRAM CIM Macro for Floating-Point Edge Training and Inference
AU - Yuan, Yiyang
AU - Zhang, Bingxin
AU - Yang, Yiming
AU - Luo, Yishan
AU - Chen, Qirui
AU - Lv, Shidong
AU - Wu, Hao
AU - Ma, Cailian
AU - Li, Ming
AU - Yue, Jinshan
AU - Wang, Xinghua
AU - Xing, Guozhong
AU - Mak, Pui In
AU - Li, Xiaoran
AU - Zhang, Feng
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2025
Y1 - 2025
N2 - SRAM CIM macros have been developed to enhance energy efficiency (EF) in edge-AI applications. However, most research has predominantly focused on inference [1-6], with relatively little exploration into the training process. Fine-tuning, during training, is critical for improving the accuracy of neural network models, which directly impacts the user experience. Unlike remote- or cloud-based training, on-device training offers advantages such as real-time response, power saving, and user-privacy protection. The training process primarily involves two phases: feed-forward (FF) and backpropagation (BP). The FF phase resembles the inference process, while BP phase requires the multiplication of the error gradient with the transposed weight matrix. Although several studies have introduced transpose CIM (T-CIM) to support both FF and BP [7-9], several challenges remain: (1) Previous work uses separate circuits for FF and BP, thereby diminishing area and energy efficiency due to lack off multiply-accumulate (MAC) circuit reuse; (2) Prior work is limited to integer (INT) formats; research indicates that INT representation can significantly reduce accuracy during training due to its lower resolution [10]. Fortunately, developments in pre-aligned FP-CIM schemes have been made [1,11,12], but these still suffer from accuracy losses due to mantissa truncation during the pre-alignment process. (3) Reliance on analog-CIM schemes leads to accuracy losses due to process, voltage, and temperature (PVT) variations, which further degrade training accuracy. Although digital CIMs (DCIM) can mitigate these issues, optimizing the tradeoffs between SRAM arrays and MAC circuits to simultaneously achieve high-memory density (MD) and area efficiency (AF) remains challenging [1]. Increasing the number of SRAM sub-bank rows and employing bit-parallel techniques can improve MD and AF [2], while approximate computing can enhance access speed and EF [3,4], as illustrated in Fig. 14.5.1.
AB - SRAM CIM macros have been developed to enhance energy efficiency (EF) in edge-AI applications. However, most research has predominantly focused on inference [1-6], with relatively little exploration into the training process. Fine-tuning, during training, is critical for improving the accuracy of neural network models, which directly impacts the user experience. Unlike remote- or cloud-based training, on-device training offers advantages such as real-time response, power saving, and user-privacy protection. The training process primarily involves two phases: feed-forward (FF) and backpropagation (BP). The FF phase resembles the inference process, while BP phase requires the multiplication of the error gradient with the transposed weight matrix. Although several studies have introduced transpose CIM (T-CIM) to support both FF and BP [7-9], several challenges remain: (1) Previous work uses separate circuits for FF and BP, thereby diminishing area and energy efficiency due to lack off multiply-accumulate (MAC) circuit reuse; (2) Prior work is limited to integer (INT) formats; research indicates that INT representation can significantly reduce accuracy during training due to its lower resolution [10]. Fortunately, developments in pre-aligned FP-CIM schemes have been made [1,11,12], but these still suffer from accuracy losses due to mantissa truncation during the pre-alignment process. (3) Reliance on analog-CIM schemes leads to accuracy losses due to process, voltage, and temperature (PVT) variations, which further degrade training accuracy. Although digital CIMs (DCIM) can mitigate these issues, optimizing the tradeoffs between SRAM arrays and MAC circuits to simultaneously achieve high-memory density (MD) and area efficiency (AF) remains challenging [1]. Increasing the number of SRAM sub-bank rows and employing bit-parallel techniques can improve MD and AF [2], while approximate computing can enhance access speed and EF [3,4], as illustrated in Fig. 14.5.1.
UR - http://www.scopus.com/pages/publications/105000823391
U2 - 10.1109/ISSCC49661.2025.10904659
DO - 10.1109/ISSCC49661.2025.10904659
M3 - Conference contribution
AN - SCOPUS:105000823391
T3 - Digest of Technical Papers - IEEE International Solid-State Circuits Conference
SP - 258
EP - 260
BT - 2025 IEEE International Solid-State Circuits Conference, ISSCC 2025
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 72nd IEEE International Solid-State Circuits Conference, ISSCC 2025
Y2 - 16 February 2025 through 20 February 2025
ER -