Computational imaging pipelines (CIPs) convert the raw output of imaging sensors into the high-quality images that are used for further processing. This work studies how Block-Matching and 3D filtering (BM3D), a state-of-the-art denoising algorithm can be implemented to meet the demands of user-interactive (UI) applications. Denoising is the most computationally demanding stage of a CIP taking more than 95% of time on a highly-optimized software implementation .We analyze the performance and energy consumption of optimized software implementations on three commodity platforms and find that their performance is inadequate. Accordingly, we consider two alternatives: a dedicated accelerator, and running recently proposed Neural Network (NN) based approximations of BM3D [9, 27] on an NN accelerator. We develop Image DEnoising AcceLerator(IDEAL), a hardware BM3D accelerator which incorporates the following techniques: 1) a novel software-hardware optimization, Matches Reuse (MR), that exploits typical image content to reduce the computations needed by BM3D, 2) prefetching and judicious use of on-chip buffering to minimize execution stalls and off-chip bandwidth consumption, 3) a careful arrangement of specialized computing blocks, and 4) data type precision tuning. Over a dataset of images with resolutions ranging from 8 megapixel (MP) and up to 42MP, IDEAL is 11, 352× and 591× faster than high-end general-purpose (CPU) and graphics processor (GPU) software implementations with orders of magnitude better energy eficiency. Even when the NN approximations of BM3D are run on the DaDianNao  high-end hardware NN accelerator, IDEAL is 5.4× faster and 3.95× more energy efficient.