日本語
English
Publications / Research Outputs
研究業績
論文誌、会議発表、学位論文を種別ごとに整理しています。
公開されている要旨、書誌情報、関連資料、PDF への導線を、項目ごとに参照できる構成にしています。
1
国際論文誌
エッジ AI、農業画像認識、CGLA アクセラレーションに関する査読付き論文です。
01.
"Lightweight YOLOX-based Green Onion Branching Point Detection for Automated Peeling on Edge Device"
Accepted
Impact Factor: 0.72 (2026)
Abstract
Links
PDF
@article{ando2026lightweight,
author = {Takuto Ando and Iori Yamaguchi and Jun Shono and Takahiro Kawabe and Koji Uchida and Kosuke Shigematsu and Yusuke Inoue},
title = {Lightweight YOLOX-based Green Onion Branching Point Detection for Automated Peeling on Edge Device},
journal = {ICIC Express Letters},
year = {2026},
note = {accepted}
}
02.
"Efficient Kernel Mapping and Comprehensive System Evaluation of LLM Acceleration on a CGLA"
Impact Factor: 3.6 (2026)
[en]
Large Language Models (LLMs) demand substantial computational resources, resulting in high energy consumption on GPUs. To address this challenge, we focus on Coarse-Grained Reconfigurable Arrays (CGRAs) as an effective alternative that provides a trade-off between energy efficiency and programmability. This paper presents the first comprehensive, end-to-end evaluation of a non-AI-specialized Coarse-Grained Linear Array (CGLA) accelerator for the state-of-the-art Qwen3 LLM family. The architecture has a general-purpose, task-agnostic design, yet its flexible instruction set allows for domain-specific adaptations. This flexibility enables the architecture to achieve high efficiency for sustainable LLM inference. We assess the performance of our architecture on an FPGA prototype using the widely adopted llama.cpp framework. We then project its potential as a 28 nm ASIC and compare it against a high-performance GPU (NVIDIA RTX 4090) and an edge AI device (NVIDIA Jetson AGX Orin). While GPUs exhibit lower latency, our non-AI-specific accelerator achieves higher energy efficiency, improving the Power-Delay Product (PDP) by up to 44.4× and 13.6× compared with the RTX 4090 and Jetson, respectively. Similarly, it reduces the Energy-Delay Product (EDP) by up to 11.5× compared to the high-performance GPU, demonstrating a favorable performance-energy trade-off. Critically, our system-level analysis identifies host-accelerator data transfer as the primary performance bottleneck, a factor often overlooked in kernel-level studies. These findings provide design guidance for next-generation LLM accelerators. This work validates CGRAs as a suitable platform for LLM inference in power-constrained environments, without being confined to specific algorithms.
@article{ando2025efficient,
author = {Takuto Ando and Yu Eto and Ayumu Takeuchi and Yasuhiko Nakashima},
title = {Efficient Kernel Mapping and Comprehensive System Evaluation of LLM Acceleration on a CGLA},
journal = {IEEE Access},
volume = {13},
pages = {199631--199646},
year = {2025},
doi = {10.1109/ACCESS.2025.3636266}
}
Efficient Kernel Mapping and Comprehensive System Evaluation of LLM Acceleration on a CGLA
03.
"DPU-Based Hardware Implementation for Real-time Facial Expression Recognition System"
Impact Factor: 0.72 (2026)
[en]
In this paper, we implemented a standalone DPU-based facial expression recognition system on SoC FPGA. The system consists of a face detection step and a facial expression recognition step. In conventional FPGA-based facial expression recognition systems, the Haar Cascade detector is run on the CPU due to FPGA (Field Programmable Gate Array) resource limitations in the face detection step. However, the Haar Cascade detector is less accurate than DNN (Deep Neutral Network)-based face detection for images of profile faces and images with changing lighting conditions. Onthe other hand, face detection using a DNN such as YOLO requires a long latency when executed on a CPU with low computing performance. Therefore, we offload face detection and facial expression recognition by DNN to DPU, a CNN accelerator on FPGA,to speed up the processing. In this work, we combined face detection with DenseBox andCNN-based facial expression recognition on the same DPU. The same DPU was used toimplement the facial expression recognition system, which enabled efficient use of FPGAresources while keeping the size of the circuitry
@article{ando2025dpu,
author = {Takuto Ando and Yusuke Inoue},
title = {DPU-Based Hardware Implementation for Real-time Facial Expression Recognition System},
journal = {ICIC Express Letters},
volume = {19},
number = {4},
pages = {419--426},
year = {2025},
doi = {10.24507/icicel.19.04.419}
}
DPU-Based Hardware Implementation for Real-time Facial Expression Recognition System
2
国内論文誌
応用評価を含む日本語の査読付き論文です。
01.
"エッジデバイス上におけるリアルタイム小ねぎ分岐部位置検出"
[ja]
近年,小ねぎ生産における農業従事者の不足が深刻な課題となっており,人手を減らす小ねぎ調製機の開発が求められている.現行の調製機では一度に不要な葉を全て除去できず,小ねぎを調製機に通した後に,人手による不要な葉を取り除く作業を行う必要がある.二次処理の不要な調製を行うためには,小ねぎの最上部の分岐部位置にノズルを合わせて調製機へ投入することが有効であると分かっている.これを行うには,分岐部の位置は個体によって異なるため分岐部の位置を判別し,位置を揃えて調製機に投入する必要がある.したがって,画像認識により小ねぎ分岐部の位置を検出し,ノズルの位置を自動で合わせることが正確な調製につながると考えた.そこで本稿では,分岐部の検出を目的として,小ねぎ分岐部における特有の斜線を抽出する手法を提案する.本手法は,低消費電力で演算能力の低いエッジデバイスでの実装を目標として,エッジ検出による画像処理をベースとした軽量なアルゴリズムを用いる.本手法を実際の小ねぎ画像に適用して,Raspberry Pi 3で評価した結果,分岐部斜線の検出率は90.6%であり.処理時間は455 msを達成した.この結果より,本手法が分岐部の検出に対して有効であることが明らかとなり,実環境にて応用できる可能性が示された.
@article{ando2024greenonion,
author = {安藤 拓翔 and 井上 優良},
title = {エッジデバイス上におけるリアルタイム小ねぎ分岐部位置検出},
journal = {農業情報研究},
volume = {33},
number = {2},
pages = {73--80},
year = {2024},
doi = {10.3173/air.33.73}
}
エッジデバイス上におけるリアルタイム小ねぎ分岐部位置検出
3
国際会議
CGLA、IMAX、エッジ AI システムに関する国際会議・フォーラム発表です。
01.
"Q-snap: Quantization-aware dynamic chunking for LLM execution on a CGLA"
Best Paper Award
Abstract
BibTeX
Links
PDF
02.
"Implementation and Evaluation of Stable Diffusion on a General-Purpose CGLA Accelerator"
[en]
This paper presents the first implementation and in-depth evaluation of the primary computational kernels from the stable-diffusion.cpp image generation framework on IMAX3, a general-purpose Coarse-Grained Reconfigurable Array (CGRA) accelerator. We designed IMAX3 as a versatile computational platform, and this work assesses its capabilities by executing a demanding image generation workload. We evaluate its performance on a current Field-Programmable Gate Array (FPGA) prototype to establish a baseline and project its potential for a future Application-Specific Integrated Circuit (ASIC) implementation. Our results demonstrate that, despite its general-purpose architecture, IMAX3 achieves promising performance and power efficiency, particularly in its projected ASIC form. This work provides concrete guidelines for future IMAX architectural designs and establishes a foundation for developing next-generation, AI-specialized Coarse-Grained Linear Array (CGLA) accelerators by refining this versatile platform. Ultimately, this achievement contributes to the realization of energy-efficient, on-device, multi-modal AI platforms.
@inproceedings{ando2025stablediffusion,
author = {Takuto Ando and Yu Eto and Yasuhiko Nakashima},
title = {Implementation and Evaluation of Stable Diffusion on a General-Purpose CGLA Accelerator},
booktitle = {2025 IEEE 18th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)},
year = {2025},
doi = {10.1109/MCSoC67473.2025.00120}
}
Implementation and Evaluation of Stable Diffusion on a General-Purpose CGLA Accelerator
03.
"Energy-Efficient Hardware Acceleration of Whisper ASR on a CGLA"
Best Paper Award
[en]
The rise of generative AI for tasks like Automatic Speech Recognition (ASR) has created a critical energy consumption challenge. While ASICs offer high efficiency, they lack the programmability to adapt to evolving algorithms. To address this trade-off, we implement and evaluate Whisper's core computational kernel on the IMAX, a general-purpose Coarse-Grained Linear Arrays (CGLAs) accelerator. To our knowledge, this is the first work to execute a Whisper kernel on a CGRA and compare its performance against CPUs and GPUs. Using hardware/software co-design, we evaluate our system via an FPGA prototype and project performance for a 28 nm ASIC. Our results demonstrate superior energy efficiency. The projected ASIC is 1.90x more energy-efficient than the NVIDIA Jetson AGX Orin and 9.83x more than an NVIDIA RTX 4090 for the Q8_0 model. This work positions CGLA as a promising platform for sustainable ASR on power-constrained edge devices.
@inproceedings{ando2025whisper,
author = {Takuto Ando and Yu Eto and Ayumu Takeuchi and Yasuhiko Nakashima},
title = {Energy-Efficient Hardware Acceleration of Whisper ASR on a CGLA},
booktitle = {2025 13th International Symposium on Computing and Networking (CANDAR)},
year = {2025},
doi = {10.1109/CANDAR68384.2025.00018}
}
Energy-Efficient Hardware Acceleration of Whisper ASR on a CGLA
04.
"Energy-Efficient Llama3 Acceleration on a CGLA by Offloading Computational Kernels"
PDF
[en]
The evolution of Large Language Models (LLMs) has led to a substantial increase in energy consumption, making energy-efficient accelerator technologies essential. We propose a method to maximize the performance of the Llama3 LLM on IMAX3, a CGRA-based accelerator. We conducted a detailed analysis of the Q3_K_S quantization method. Based on this analysis, we implemented a new custom kernel to offload the Q6_K operations to the hardware. In previous work, this operation was processed on the host CPU. This targeted optimization reduced the end-to-end latency by approximately 11%. Furthermore, an evaluation of a projected ASIC implementation showed that the absolute performance does not match that of a GPU. In contrast, our approach achieves superior power efficiency, as measured by the Power-Delay Product (PDP).
@inproceedings{ando2025llama3forum,
author = {Takuto Ando and Yu Eto and Ayumu Takeuchi and Yasuhiko Nakashima},
title = {Energy-Efficient Llama3 Acceleration on a CGLA by Offloading Computational Kernels},
booktitle = {2025 IEEE 38th International System-on-Chip Conference (SOCC) PhD/Master Forum},
year = {2025},
doi = {10.1109/SOCC66126.2025.11235407}
}
Energy-Efficient Llama3 Acceleration on a CGLA by Offloading Computational Kernels
05.
"LLM Performance Bottlenecks on CGLA"
Abstract
BibTeX
Links
PDF
06.
"Energy-Efficient FlashAttention Acceleration on CGLA"
Abstract
BibTeX
Links
PDF
07.
"Performance Evaluation of Flan-T5 in CGLA Based Accelerators"
Abstract
BibTeX
Links
PDF
08.
"A Detailed Analysis of LLM Execution on IMAX3 and Initial Evaluation of IMAX4 Prototype for Server Environment"
Young Researcher Award
[en]
In this paper, we present a detailed analysis of the IMAX3 CGRA-based accelerator and the IMAX4 prototype with an upgraded host CPU. To address the host CPU bottlenecks of its predecessor, IMAX3, IMAX4 incorporates a server-oriented Intel Xeon processor and PCIe Gen5 connectivity, realizing IMAX scalability. We implemented and evaluated the IMAX4 prototype using microbenchmarks and the LLaMA3 8B quantized model. The results demonstrate significantly reduced host-side overheads and improved data transfer compared to IMAX3. While performance characteristics vary by quantization, IMAX4 demonstrates significant potential by shifting bottlenecks from the host to data pathways, highlighting the viability of CGRA for server-based LLM acceleration.
09.
"Facial Expression Recognition System Using DNN Accelerator with Multi-threading on FPGA"
[en]
In this paper, we implement a stand-alone facial expression recognition system on an SoC FPGA with multi-threading using a Deep learning Processor Unit (DPU). The system consists of two steps: one for face detection step and one for facial expression recognition. In the previous work, the Haar Cascade detector was run on a CPU in the face detection step due to FPGA resource limitations, but this detector is less accurate for profile and variable illumination condition images. Moreover, the previous work used a dedicated circuit accelerator, so running a second DNN inference for face detection on the FPGA would require the addition of a new accelerator. As an alternative to this approach, we run the two inferences by DNN on a DPU, which is a general-purpose CNN accelerator of the systolic array type. Our method for face detection using DenseBox and facial expression recognition using CNN on the same DPU enables the efficient use of FPGA resources while maintaining a small circuit size. We also developed a multi-threading technique that improves the overall throughput while increasing the DPU utilization efficiency. With this approach, we achieved an overall system throughput of 25 FPS and a throughput per power consumption of 2.4 times.
@inproceedings{ando2024multithreading,
author = {Takuto Ando and Yusuke Inoue},
title = {Facial Expression Recognition System Using DNN Accelerator with Multi-threading on FPGA},
booktitle = {2024 12th International Symposium on Computing and Networking Workshops (CANDARW)},
pages = {103--109},
year = {2024},
doi = {10.1109/CANDARW64572.2024.00025}
}
Facial Expression Recognition System Using DNN Accelerator with Multi-threading on FPGA
10.
"FPGA Implementation of a DPU-Based Facial Expression Recognition System"
[en]
In this paper, we implemented a standalone DPU-based facial expression recognition system on SoC FPGA. In conventional FPGA-based systems, the Haar Cascade detector is run on the CPU for face detection due to FPGA resource limitations. We offload face detection and facial expression recognition by DNN to DPU, a CNN accelerator on FPGA. The same DPU was used to implement the facial expression recognition system, which enabled efficient use of FPGA resources while minimizing the size of the circuitry.
4
国内研究会
実装詳細やシステム評価を整理した国内研究会・大会発表です。
01.
"CGLA上の多様なAIワークロードに向けた統合ソフトウェアスタックの実装と今後の展望"
Abstract
BibTeX
Links
PDF
02.
"3D積層DRAMを用いたCGLAのLLM向けスケーラビリティ評価"
Abstract
BibTeX
Links
PDF
03.
"DNNアクセラレータを用いた表情認識システムにおける電力効率の向上"
学生奨励賞
[ja]
本研究ではDPUを用いた表情認識システムをSoC FPGA上に実装する.先行研究のFPGAで実装された表情認識システムでは,顔検出の処理をFPGAリソースの制約からCascade検出器を用いてCPU上で実行していた.しかし,この手法ではDNNベースの顔検出よりも精度が劣る.そこで本研究では,DNNによる顔検出と表情認識を同一のDPUで時分割実行することで,リソースを抑えつつFPGAで実行可能なハードウェア構成を提案する.また,全体の消費電力とスループットを向上させるためにマルチスレッドを活用した手法を提案する.本システムは先行研究より消費電力あたりのスループットは約2.4倍の向上を達成した.
04.
"小ねぎ調製位置検出のためのインスタンスセグメンテーション"
05.
"FPGAにおけるDPUを用いた表情認識システムの実装"
06.
"FPGAによるリアルタイム表情認識システムの実装"
Abstract
BibTeX
Links
PDF
07.
"エッジ検出を用いたこねぎ分岐部の検出"
[ja]
近年,こねぎ生産における最低賃金の上昇等雇用環境が変化しており,人件費を抑えるために人手の不要なこねぎ調製機の開発が求められている.現行の調製機では一度に不要な葉を全て除去できず,二次処理のために多くの人手を要する.正確な調製を行うためには,外葉最上部の分岐部位置に調製用ノズル位置を合わせることが有効であることが分かっている.したがって,こねぎ分岐部の位置を検出し,ノズルの位置を自動で合わせることができれば二次処理を必要としない調製機を開発できる.そこで本稿は,分岐部を検出するためにこねぎ分岐部における特有の斜線を抽出する手法を提案する.本手法は,エッジ検出による斜線抽出とラベリング処理を行うことで頑健な検出を行う.本手法を実際のこねぎ画像に適用した結果,分岐部斜線の検出率は 92%であり,分岐部位置を検出できた.
5
学位論文
高専本科の卒業論文と、専攻科における特別研究論文です。