A Style-Based Generator Architecture for Generative Adversarial Networks

Metadata

Reading status: read complete
Year: 2018
Compute regime: Generative media compute (generative_media_compute)
PDF: 2018-stylegan_2018.pdf
Extracted text: 2018-stylegan_2018.txt
PDF URL: https://arxiv.org/pdf/1812.04948.pdf
OpenAlex:
Citation count source/date:
Citation count:
Reading card created: 2026-06-15

Compute Setup

The paper explicitly reports the training device: about one week on an NVIDIA DGX-1 with 8 Tesla V100 GPUs. This places StyleGAN in the Volta tensor-core, multi-GPU workstation/server era rather than a datacenter-scale training regime. The codebase builds on the official TensorFlow Progressive GAN implementation, inheriting the discriminator, resolution-dependent minibatch sizes, Adam hyperparameters, and exponential moving average of the generator unless otherwise stated.

The main image target is 1024x1024 face synthesis on FFHQ, a new 70,000-image dataset introduced in the paper. The style-based generator has 26.2M trainable parameters, compared with 23.1M for the traditional baseline generator. All FIDs in the paper are computed from 50,000 generated images without truncation. The improved FFHQ setup uses R1 regularization and extends training from 12M to 25M images because FID keeps decreasing longer than with WGAN-GP. Other LSUN models are trained with the same setup for 70M images for Bedrooms and Cats and 46M images for Cars.

Bottleneck

The bottleneck is high-resolution adversarial convolutional training under limited multi-GPU memory and stability. A 1024x1024 generator/discriminator pair must synthesize and judge images across scales, from pose and layout down to hair, freckles, and texture. Progressive GANs made this feasible by growing resolution over training, but convergence still takes many millions of images and remains sensitive to loss, regularization, learning rate, and dataset artifacts.

The second bottleneck is representation control. Traditional GAN generators feed the latent code through the input layer and let the network entangle high-level structure, color, texture, and stochastic detail. That is not just an interpretability problem: it means the model must use the same latent pathway for global pose and local random hair placement. At high resolution, forcing one latent vector to carry all deterministic and stochastic variation makes synthesis harder to control and harder to debug.

The paper's compute setting matters because the solution must remain trainable on a single 8-V100 DGX-1. It cannot assume a new massive discriminator, extra supervised labels, or expensive search at inference. The method therefore changes the generator's internal parameterization while leaving the discriminator and loss function unchanged.

Method Adaptation

StyleGAN keeps the Progressive GAN training regime but redesigns the generator. Instead of feeding z only at the input, an 8-layer fully connected mapping network maps z into an intermediate latent space W. Learned affine transforms convert W into per-layer styles, and those styles control adaptive instance normalization after each convolution. The synthesis network starts from a learned 4x4 constant rather than a traditional latent input.

The design is compute-compatible with the DGX-1 setting. It adds a moderate number of parameters, preserves dense convolutional kernels, and does not require changing the discriminator. The style controls are localized by resolution: coarse layers affect pose, face shape, and eyeglasses; middle layers affect smaller facial features and hair style; fine layers affect color scheme and microstructure. This makes the generator's multi-scale computation match the progressive high-resolution training schedule.

The paper also separates stochastic detail from the latent code by injecting per-layer Gaussian noise through learned per-channel scaling. That gives the network a cheap source of local randomness for hair, freckles, fabric, and background details without forcing these effects through W. Style mixing regularization then trains some images with two latent codes switched at a random layer, discouraging adjacent styles from becoming overly correlated. Because deeper mapping networks could become unstable at high learning rates, the mapping network learning rate is reduced by two orders of magnitude.

Evidence

The main FID table shows the generator changes paying off. On FFHQ, the baseline Progressive GAN configuration starts at FID 8.04. The tuned baseline reaches 5.25. Adding the mapping network and styles reaches 4.85; removing the traditional input is 4.88; adding noise inputs reaches 4.42; and adding mixing regularization reaches 4.40. On CelebA-HQ, the baseline starts at 7.79 and the final configuration reports 5.17.

The mixing table shows why regularization matters for compositional use. Without mixing regularization, the FFHQ model has FID 4.42 when tested with one latent but degrades to 8.22 with two test latents and 17.41 with four. With 90% mixing regularization, the one-latent FID is 4.40, and the two-, three-, and four-latent FIDs are 5.11, 6.88, and 9.03. That is evidence that the method is not only improving static image quality; it is making the per-layer computation robust to style recombination.

The training-convergence appendix ties quality to long high-resolution runs. With R1 regularization, FID continues to decrease as training proceeds, motivating the move from 12M to 25M images on FFHQ. The paper marks 8.4M images as the point where training has reached full 1024x1024 resolution, after which FID still improves. Other dataset evidence includes LSUN Bedroom FID 2.65 at 256x256, Cars FID 3.27 at 512x384, and Cats FID 8.53 at 256x256.

Historical Effect

StyleGAN turned high-resolution GANs from black-box samplers into controllable synthesis systems. Historically, the compute achievement was not a larger model but a better internal allocation of the existing DGX-1-scale convolutional computation: global attributes, mid-level features, fine texture, and stochastic detail became separable handles. That made latent-space editing, style mixing, truncation in W, and later inversion workflows central to image-generation practice.

It also set a strong quality bar for 1024x1024 face generation before diffusion models became dominant. The FFHQ dataset, released code, pretrained networks, and automated metrics for perceptual path length and separability made StyleGAN a reference platform for studying generative representations.

Limits

The method does not remove the adversarial-training burden. It still needs about a week on an 8-V100 DGX-1 for the main setup and much longer image-count schedules for some datasets. Training remains sensitive enough that the paper changes loss/regularization for FFHQ, tunes high-resolution learning rates, and lowers the mapping-network learning rate for stability.

Better FID also does not mean complete disentanglement. The convergence appendix notes that FID improvements can come with rising path lengths, suggesting a more entangled representation as training continues. Dataset quality remains a limit: the authors attribute some LSUN Bedroom issues to inherited compression artifacts, and Cats remains difficult because of variation in pose, zoom, and background. Finally, the approach is GAN-specific; it improves controllability and quality but does not provide diffusion-style likelihood training or broad text conditioning.