AI Art Generation Handbook/How does AI generate art ?

The rise of artificial intelligence has led to a surge in the creation and enhancement of AI-generated images (especially in 2022, after the release of DALL-E2 ) resulting in highly detailed and imaginative artwork. This development might prompt user to ask, “How does AI produce art?"

Humans usually draw inspiration from their surroundings—such as forests, urban landscapes, and their own reflections to put their inspirations into their art.

Similarly, AI art involves the creation of artwork with the help of generative AI. This technology identifies patterns within large datasets and uses this knowledge to generate new content. To create AI art, one needs an AI art generator, like Stable Diffusion, and a concept. The AI artist inputs a detailed prompt, which the tool then interprets to generate image options based on the description provided.

Artist Vera Molnár, a Hungrarian artist began experimenting with early programming languages to produce randomly generated artwork in 1968. Considered a pioneer in generative art, her geometric creations are included in major museum collections. She is considered as a pioneer in generative art, her geometric creations are included in major museum collections.

The core technology behind this capability is called a neural network. A neural network is a sophisticated mathematical system, or algorithm, that is designed to mimics biological neural networks in the human brain and functions to recognize patterns in extensive datasets.

The neural network have few components: [3]

(i) Input Layer: This layer receives the initial input data, such as an image, text, or numerical values.

(ii) Hidden Layer: They are intermediate layers between the input and output layers, where most of the mathematical operations (e.g matrix multiplications) occurs.

(iii) Output Layer: This layer produces the final output of the neural network, which is the generated output (e.g., an image or text).

(iv) Connections (Weights and Biases): Neurons in adjacent layers are connected by weights, which determine the strength of the connections. Biases are additional parameters that shift the activation of a neuron.

When a user prompt an AI generator to depict a dog, the neural network draws on the vast amount of information it has been trained on to create a new image. This process involves complex layers of interconnected nodes that simulate the way a human brain processes information. The role as the AI artist is to refine these generated images, guiding the AI models to produce specific scenes, such as a dog impersonate a French wearing beret, dog sitting in a bar or a dog dancing in a kitchen. These neural networks are brimming with the trained data, but it is the creativity and direction from the user that truly shapes AI-generated art.

Dog wearing a beret by Di (they-them)
Dog sitting at a bar by Michal.palasek

There are 2 major parts of AI Art : Training and Inferences [4]

Training:Training is the process of teaching the neural network model to learn the patterns and relationships present in the training data.

Inference: Inference is the process of using the trained model to make predictions or generate outputs on new, previously unseen data.

As in the field of artificial intelligence, the training data is the heart of generative AI. However, as saying goes, quality over quantity. All in all, general consensus is the quality of the image aesthetic and better image description / captioning is much better than the "token" of AI model it is able to be processed. All of the AI model is trained using pair of text captions - images ,which one of the most popular method is using CLIP [5]. Here is the the known image dataset training size


Entity	Image Dataset Training Size
Midjourney	~1B+ [a]
DALL-E2	250M [b]
Craiyon	15M [c]
Google Imagen	860M [d]
Stable Diffusion 1.5	400M [e]
Stable Diffusion XL (SDXL)	1.8M+1.2M [f]

The picture shows how the AI Text-to-Art oversimplifications of process flow chart of converting from words and process it until the images is produced.

(A) The prompt text of the input in words form are being "tokenized" into tokens by the Text Encoders

(B) Then tokens are mapped to a dense vector representation (embedding vector), capturing semantic and contextual information about the token.

(C) The Diffusion Model are generating the images with the forward diffusion techniques based on the embedding vectors

(D) The dense vector representation (latent vector) represent the latent image representation essential visual features, content, and attributes of the image to be generated, compressed into a latent space representation.

(D) Image Decoders are synthesizing visual features like textures, colors, shapes, etc. based on the information encoded in latent vector. After synthetizing the images , it will upscaling the images whilst performing enhancement to increase image aesthetic.