Generative AI has made notable progress in producing visually convincing images, but scientific imaging requires much more accuracy. In materials science and biology, images are judged by whether they capture real physical structure, not by how realistic they appear. A new study from Lawrence Berkeley National Laboratory asks whether today’s generative AI models are precise enough to be trusted for scientific imaging.
Published in the Journal of Imaging, the research evaluates several classes of generative AI models for their ability to reproduce high-resolution scientific images, including microCT scans of rock sediments, composite fibers, and plant roots. The work compares Variational Autoencoders, Generative Adversarial Networks, and diffusion-based models, examining how well each approach captures the underlying physical and biological features required for scientific analysis.
The Berkeley Lab team’s goal was to determine whether generative models could realistically fill gaps in experimental data, support large-scale benchmarking, or allow scientists to study processes that are too rare or expensive to capture through experiments. The researchers evaluated success by examining how well generated images preserved structural coherence and scientific accuracy when compared with real datasets.
This infographic shows the results of different generative AI models in creating scientific images (Credit: Daniela Ushizima)
“When we generate images for science, we are not chasing aesthetics, we are chasing truth,” said Daniela Ushizima, a senior scientist in Berkeley Lab’s Applied Math and Computational Research (AMCR) Division and the principal investigator of this research project, as quoted in a Berkeley Lab article. “While consumer tools can dazzle with style, scientific models have to encode the real physics and biology. When the science is right, these models can reveal patterns we have never seen before and accelerate discovery in ways experiments alone never could.”
To assess performance, the team employed a combination of quantitative metrics commonly used in image analysis, including Structural Similarity Index Measure, Learned Perceptual Image Patch Similarity, Fréchet Inception Distance, and CLIPScore. These measures evaluate factors such as structural consistency, perceptual similarity, and alignment between images and text descriptions. The researchers paired these metrics with expert review by domain scientists in materials science and biology, noting that standard quantitative measures can sometimes miss scientific inaccuracies.
“Traditional quantitative metrics can tell us if an AI-generated image looks realistic or matches certain patterns in real data, but they can’t always detect subtle errors that make an image scientifically inaccurate. That’s why expert validation will remain essential for ensuring that AI-generated images meet the rigorous standards of scientific research,” said Ushizima.
The study found that different model architectures have different strengths and weaknesses. GAN-based approaches, including Nvidia’s StyleGAN, often produced visually coherent images that maintained important structural details seen in real data. Diffusion models generated very realistic images, but that realism did not always translate into scientific accuracy. The researchers also tested creative platforms RunwayML and DeepAI but concluded that they do not reliably meet the standards required for scientific research.
High performance computing was critical to carrying out the analysis, and the work is an example of how AI-for-science workflows are increasingly dependent on leadership-class HPC resources. The team used the National Energy Research Scientific Computing Center’s Perlmutter supercomputer both to train models from scratch and evaluate pretrained models at scale. GPU acceleration allowed the researchers to process large imaging datasets and perform extensive comparisons that would be impractical on conventional systems.
Training GAN-based models was the most computationally intensive task, with StyleGAN requiring four NVIDIA A100 GPUs and several hours per iteration to train on high-resolution 512×512 image datasets. Simpler GAN variants trained on lower-resolution images ran on a single A100 GPU, with iteration times of roughly 40 minutes. In contrast, diffusion-based workflows focused primarily on inference, with individual image generations completing in seconds to minutes on a single A100 GPU. Using Perlmutter allowed the researchers to run these workflows at a consistent scale and resolution, making it possible to compare model behavior, performance, and scientific accuracy under realistic HPC conditions.
Beyond model comparison, the study emphasizes transparency and reproducibility. By documenting datasets, training procedures, and evaluation methods in detail, the researchers aim to provide a reusable framework for assessing generative AI in scientific contexts. By applying a common set of evaluation metrics, including SSIM, LPIPS, FID, and CLIPScore, the team created a repeatable way to assess image quality and compare results across models.
This study highlights the potential and current limits of generative AI for scientific imaging. While some models can produce images that closely resemble real data, the results show that visual realism alone is not a reliable indicator that an image is scientifically accurate. The findings reinforce the importance of rigorous evaluation and domain expertise when using generative models for research problems where accuracy is critical. Looking ahead, the Berkeley Lab team plans to adapt generative AI models more closely to scientific imaging tasks, expand them to larger and more diverse datasets, and develop validation methods that can ensure reliability. Read the full paper here.
This article first appeared on HPCwire.
The post Berkeley Lab Uses Perlmutter to Evaluate GenAI for Scientific Imaging appeared first on AIwire.

