User Tools

Site Tools


it:ai_imagegeneration

AI - image generation

Introduction

  • this is a major risk for ascertainment of authenticity and truth, the internet is rapidly becoming inundated with AI-generated fake imagery and videos
  • understanding how these are trained and generated will help you greatly understand machine learning, and in this case, finding patterns in random dots to generate a targeted image based on text prompts - pretty amazing concepts
  • AI image generators
  • AI can also be used to detect fake AI created imagery:
    • Google's SynthiD invisible watermark technology
    • China has banned AI-generated images devoid of watermarks as of 2023
  • the immediate future of the rapidly evolving generative AI field:
      • Nvidia AI Workbench
        • allows anyone to access generative AI - users can develop using tools like JupyterLab and VS Code
        • simplifies the AI model development process by providing a single platform for managing data, models, and compute resources that supports collaboration across machines and environments.
        • integrates with services such as GitHub, NVIDIA NGC, and Hugging Face, self-hosted registries, and Git servers
        • allows users to clone a Python environment and project environment without having to install everything themselves
        • allows users to run Jupyter on cloud workstations of various powers
      • Nvidia Omniverse
        • utilises OpenUSD to connect USD tools together and create virtual simulated physical world environments with AI
        • rapid AI development via ChatUSD
      • Nvidia GH200 GPUs to accelerate large language model computing and reduce power usage
      • Nvidia L40S GPUs for workstations

The problem of non-consensual deep fake imagery

  • bad actors will use the technology in a negative manner
  • in Dec 2023, OctoML, which is a machine learning acceleration platform, severs ties with Civitai (uses Stable Diffusion’s AI image generation technology to make custom models which it claims is a “pretty SFW model”) after an investigation by 404 Media which accused Civitai of profiting from nonconsensual AI explicit imagery.
  • in Dec 2023, not surprisingly there were reports of increasing AI apps being available that “remove” clothes from photos of people and in the process contribute to fake humiliating scandals, etc

Preventing unauthorised use of your images for AI training

  • AI image generation models need to train on thousands of images, most have acquired such images by scraping the web without permission from copyright holders
  • 2023:
    • a software image editing process called Nightshade effectively “poisons” your image and if thousands of such poisoned images are used in training an AI model, the mode becomes contaminated resulting in corruption of the object classification and style classification systems

Systems to detect AI generated images

  • 2023:
    • various companies have developed proposals for invisible watermark systems to be used for all AI generated images to ensure fakes can be detected

AI image generation Text2Image models

DALL-E

  • an OpenAI image generation app
  • Oct 2023, version 3 released
    • tweak the prompt through conversations with ChatGPT and “can translate nuanced requests into extremely detailed and accurate images”
    • has more safeguards such as limiting its ability to generate violent, adult, or hateful content.
    • has mitigations to decline requests that ask for images of a public figure by name, or those that ask for images in the style of a living artist.
    • can access via Bing search image creation (you need to log into your Microsoft account)
    • Microsoft is embedding DALL-E3 in its new “CoPilot” AI app which can utilise DALL-E3 from with MS Paint, Windows 11, MS 365, Edge etc.

Craiyon (DALL-E Mini)

  • free access cloud based simple basic AI generation app for newbies but has ads

Google's ImageGen

Alibaba's Tongyi Wanxiang

MidJourney

  • a large language model (LLM) that utilizes AI to create images from text
  • need to sign up for Discord
  • basic plan is $US10/mth - no free version
  • by describing the image you want to see, Midjourney can generate results by using the enormous dataset of images it has
  • some people found that simply adding H6D (to get Hasselblad studio camera images from the dataset) to the end of a prompt could yield higher-quality results
  • if you want a shallow DOF result then add in an aperture such as f/1.4 in the prompt or editorial headshot rather than snapshot
  • also adding in a desired lighting term such as rembrandt lighting may help it find the image style you want - but it doesn't create these from scratch - its not a simulator but a generator from pre-existing images - “Cinematic” often results in darker, low-key images, or use “Low-key” and “high-key” and angle of the shot such as “low-angle” while “Color grading” tends to search for complementary and bold colors

Stable Diffusion XL (SD-XL)

  • utilises reverse diffusion, first proposed in a paper published in 2015 1), but speeds up the computing by reducing images to latent spaces which are then manipulated
  • does not use Discord
  • Windows GUI apps for SD-XL
    • the main Windows GUI for this is Automatic 1111
      • Automatic 1111 which is a Stable Diffusion web UI browser interface for Stable Diffusion based on Gradio library for running on your local nVidia GPU under Windows 10 and needs Python 3.10.6. (as of Aug 2023, DON’T use Python 3.11 or newer)
      • how to install SD-XL on Windows - suggests you need to uninstall and re-install Python if it is not version 3.10.6! although v3.10.11 seems to work fine with Automatic 1111
      • Automatic 1111 uses it's own virtual environment, the python binary it uses should be stable-diffusion-webui\venv\Scripts\Python.exe so it should not matter what version is on your machine?
    • ComfyUI
      • an alternative GUI
  • open source
    • the Stable Diffusion XL model is available for download at HuggingFace
  • a model that can be used to generate and modify images based on text prompts
  • this is the culmination of collective effort to create a single file that compresses the visual information of humanity into a 6Gb file.
  • consists of an ensemble of experts pipeline for latent diffusion:
    • The language model (the module that understands your prompts) is a combination of the largest OpenClip model (ViT-G/14) and OpenAI’s proprietary CLIP ViT-L
    • the base model is used to generate (noisy) 128×128 pixel latents
    • these images are further processed with a refinement model specialized for the final denoising steps to produce 1024×1024 pixel images (available here: https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-1.0/)
  • can generate realistic faces, legible text within the images:
  • offers several artistic styles for image generation:
    • No style, Enhance, Anime, Photographic, Digital Art, Comic book, Fantasy art, Analog film, Neon punk, Isomteric, Low poly, Origami, Line Art, Craft clay, Cinematic, 3D model, and Pixel Art
  • has the ability to generate image variations using:
    • image-to-image prompting
    • inpainting (re-imagining of the selected parts of an image)
    • outpainting (creating new parts that lie outside the image borders)
  • v1.0 has output resolution of 1024×1024 pixels which needs:
    • at least 8GB VRAM on the nVidia CUDA compatible GPU (6Gb will work but will take 1hr to generate instead of 30secs with 8Gb and 20secs with 12Gb)
    • at least 12Gb VRAM on GPU to train a 1024×1024 LoRA model (but may take 90mins) while 24GB VRAM will do this in seconds
    • you should also have 64Gb RAM although 32Gb will suffice
    • if you don't have the above, use Google Colabs and run it online or you can use other online tools such as https://dreamstudio.ai/ (you will need to register)

Running Stable Diffusion via Automatic 1111 on your computer

  • assumes you have already installed Automatic 1111 on your computer
  • create a Windows shortcut to the stable-diffusion-webui\wubui-user.bat file and run it

txt2img

  • prompts:
    • words near the front of your prompt are weighted more heavily than the things in the back of your prompt
    • control weights applied to prompt tokens as follows:
      • single brackets 1.05; double brackets 1.1025; 4 brackets each side 1.216; 6 brackets each side 1.34; 8 brackets each side 1.477;
      • or shortcut (keyword:weight) eg. (tent:1.477)
      • square brackets results in de-emphasizing - ie dividing by the above factor instead of multiplying
    • try content type > description > style > composition
      • Content type: eg. photograph, drawing, sketch, 3D render
      • description: define the subject, subject attributes, environment/scene. The more descriptive you are with the use of adjectives, the better the output.
      • style: pencil drawing, concept art, oil painting, etching, realistic, cinematic, Watercolor, studio portrait, hyperrealistic, pop art, marble sculpture
        • lighting:
          • accent lighting, ambient lighting, backlight, blacklight, blinding light, candlelight, concert lighting, crepuscular rays, direct sunlight, dusk, Edison bulb, electric arc, fire, fluorescent, glowing, glowing radioactively, glow-stick, lava glow, moonlight, natural lighting, neon lamp, nightclub lighting, nuclear waste glow, quantum dot display, spotlight, strobe, sunlight, ultraviolet, dramatic lighting, dark lighting, soft lighting, gloomy
        • detail:
          • highly detailed, grainy, realistic, unreal engine, octane render, bokeh, vray, houdini render, quixel megascans, depth of field (or dof), arnold render, 8k uhd, raytracing, cgi, lumen reflections, cgsociety, ultra realistic, volumetric fog, overglaze, analog photo, polaroid, 100mm, film photography, dslr, cinema4d, studio quality
        • technique:
          • Digital art, digital painting, color page, featured on pixiv (for anime/manga), trending on artstation, precise line-art, tarot card, character design, concept art, symmetry, golden ratio, evocative, award winning, shiny, smooth, surreal, divine, celestial, elegant, oil painting, soft, fascinating, fine art
          • in the style of … Claude Monet, Vincent van Gogh, Pablo Picasso, Johannes Vermeer, John Singer Sargent, Alphonse Mucha, etc
      • composition:
        • aspect ratio
          • this can make a massive difference on the output depending upon how the model was trained and what your prompts are
        • camera view
          • ultra wide-angle, wide-angle, aerial view, massive scale, street level view, landscape, panoramic, bokeh, fisheye, dutch angle, low angle, extreme long-shot, long shot, close-up, extreme close-up, highly detailed, depth of field (or dof)
        • camera resolution
          • 4k, 8k uhd, ultra realistic, studio quality, octane render
        • style and composition
          • Surrealism, trending on artstation, matte, elegant, illustration, digital paint, epic composition, beautiful, the most beautiful image ever seen
        • colours
          • Triadic colour scheme, washed colour
  • sampling method:
    • some say it doesn't make a lot of difference
    • if you want soft and artsy, you could use DPM_adaptive or DDIM; if you want variety go for DPM_fast; and if you’re looking for photorealism try DPM2 or Euler_a.
    • for a detailed explanation see https://stable-diffusion-art.com/samplers/
  • steps:
    • how many steps to spend generating (diffusing) your image. More steps, more image quality and time to generate.
  • image ratio:
    • currently this has a big impact if you are wanting a person in the image:
      • if you have it in portrait aspect ratio, there is a high likelihood you will get two heads or two bodies attached on top of each other - can minimise this by including standing, long dress, or legs in your prompt
      • if you have it in landscape aspect ratio, there is a high likelihood you will get two heads side beside or a distorted body or two people
      • thus easiest is 1:1 ratio for portraits

img2img

  • use an existing image as a base input or a sketch you draw in the app under sketch tab
  • check restore faces
  • set resize mod to resize or resize and crop as needed
  • set sampling mode eg. DPM++2M Karras or Euler a
  • set resolution eg 1024,1024 if 1:1, 1128,848 if landscape, 848,1128 if portrait
  • set denoising strength - the amount of randomness based upon the random seed- usually start with around 0.5-0.7 - the higher, the more “creative” the outcome
  • set a prompt - can be much simpler than the txt2img prompt given you are starting with an image eg. change hair color
  • to avoid wholesale changes to the image, you can click on inpaint tab and just mask in the areas you want changed (or not changed - choose mask mode to do this)
    • set mask blur level as desired - usually between 4-10
    • set masked content: to just change colour of hair for instance use original, if you want hair very different and colored then choose latent noise
    • set inpaint area - choose whole picture if you want the AI to examine whole image as inspiration for the changes to the masked area, else choose Only Masked and then choose the padding size of how many pixels outside the mask to use for inspiration
  • to add something such as a scarf, use inpaint sketch
    • choose a color, mask the area where you would like the scarf
    • in the prompt use RED WOOLLEN SCARF
    • increase denoising strength to 0.8 and cfg to 5.5
  • inpaint upload allows using a mask created in Photoshop, etc
  • sketch allows you to create a color filled sketch of your ideas and then reference the areas in your prompt to generate an image

base models "checkpoints"

  • Custom checkpoint models are made with (1) additional training and (2) Dreambooth
    • They both start with a base model like Stable Diffusion v1.5 or XL.
    • Additional training is achieved by training a base model with an additional dataset of images you are interested in however this will require a lot of VRAM (at least 12Gb)
  • NB. some checkpoints were trained with a variational autoencoder (VAE) which improves image quality in which case you will need to use that VAE as well to get good imagery

LCM Lora for faster image generation

LoRA pre-trained models

common image output issues and how to fix them

train LoRA with your own images

  • NB. you will need at least 12-24Gb VRAM and nVidia CUDA capable GPU
  • using Kohya's GUI
    • requires Visual Studio, Python, Git to be installed
    • create a folder under c:\users\…
    • then run setup.bat and it will create a venv - Python 3.10.9 is recommended
      • at option, choose 1 install kohya_ss gui, then next option choose 2 Torch 2
      • then it will ask what machine - hit Enter to choose This machine
      • then it will ask type of machine - hit Enter to choose No Distributed Training
      • then it will ask if CPU only - hit No if you have a GPU
      • do you want to optimise script with torch dynamo - hit No
      • do you want to use DeepSpeed - hit No
      • What GPU for training - type all
      • What fp to use - if you have nVidia GPU > 3000 then use bf16
    • to launch the gui, click on gui.bat file and a URL will be displayed for you to copy and paste into your browser
    • then find at least 10 high quality images to train with and save them to a training folder
      • these should be high resolution, different backgrounds, different expressions which represent the person if portrait
      • avoid out of focus images, avoid multiple characters in the image, don't crop the images as you get better results uncropped
      • if using Google - use Google Advanced Image Search and then you can choose image size eg. larger than 4mp
    • find a large number (~1000) images of 1024×1024 of the same class of your subject (eg. woman) and save them to a regularised folder which will be used as comparison images
    • in the GUI you need to:
      • set parameters - these can be loaded in with a json config file if you have one
        • save trained model as safetensors
      • select the pre-trained model in SDXL - ie. the base model
      • tick SDXL model
      • under Lora Tools go to Deprecated
        • set instance prompt to a named prompt which works best if SD-XL already has been trained on that prompt and knows it eg. a celebrity - if the person is not a celebrity find a celebrity that looks like that person using StarByFace
        • set class eg. man
        • set training image folder
        • set training image repeats to 20
        • set a destination training directory eg subfolder PICTURES/SubjectName
          • 4 subfolders will then be created:
            • img and a subfolder named after image repeats and instance prompt
            • log
            • model
            • reg
        • then click on Copy info to Folders Tab button
      • under Lora Training
        • click on Folders and you should now see the folders as outlined above
        • set the Model output name as follows SubjectDesc-InstancePrompt so you don't forget what you need to put int your prompt
      • under Utilities tab Blip captions you then create image caption text files:
        • choose you training image folder
        • set caption file extension to .txt (default)
        • prefix should be your Instance prompt
        • leave other settings as defaults then click on caption images which will generate txt files describing the image including your instance prompt
        • you can then further edit these txt files to add more details
        • copy these txt files to the subfolder created under img
      • under LoRA parameters there are a lot of settings
        • train batch size - start with 5 for smoother outcomes for a celeb but perhaps with 1 for a person
        • epoch could be 10
        • save every N epochs could be 1
        • caption extension is the extension you used in the blip captioning (ie. .txt)
        • mixed precision and save precision will be set at what your install settings were eg. bf16
        • leave number cpu threads per core and seed as the defaults
        • ensure cache latents and cache latents to disk are both checked
        • learning rate 0.0001-0.0003 seems to work best
        • LR scheduler = constant
        • LR warmup = 0
        • optimizer = Adafactor
        • if use Adafactor which uses less vRAM, you need optimizer arguments = scale_parameter=False relative_step=False warmup_init=False
        • max resolution 1024,1024 (if training a style 768,768 will be fine and will reduce VRAM usage)
        • check Enable Buckets (this is why there is no need to crop images if using high resolution 4mp or higher ones)
        • min bucket resolution 256 max 2048
        • text encoder and unet learning rates same as what you used above
        • SDXL specific - check no half VAE but dont use cache text encoder outputs which although should reduce VRAM usage it is currently broken
        • Network Rank (dimension) - how much detail the model retains - the higher the more detail saved - set to 256 with alpha set to 1 will give a larger model could be 1.6Gb and use much more VRAM perhaps 12-20Gb while a 160Mb model with Rank = 32 and Alpha = 16 may be adequate for your needs, albeit with much less detail retained
        • Network Alpha
        • remainder on that tab can be left at 0 as per defaults
        • need to check Gradient checkpointing
        • CrossAttention = xformers
        • check Don'tUpscaleBucketResolution
    • Now you can click Start Training button
      • with 12 training 1024×1024 images, 20 repeats, 10 epochs, 4800 steps it may take 2hrs on 24GB VRAM nVidia 3900
    • Now you need to evaluate each of the epoch generated models - in this case there are 10 saved models
      • move the saved models to the Lora folder of SDXL
      • in txt2ing, use a prompt with your instance prompt and the class description you used
      • set resolution and set a seed value (not -1) so you can compare the Lora model outputs
      • load the lora model via opening lora tab (click button on R of the waste bin button) and then click on the LoRa and it will be appended to your prompt - you can add as many as you like but then you need to cut them and paste them into the X values section (append each except the last with a comma) when you have selected Script to be X/Y/Z plot and X type to be Prompt S/R - this will create a batch process to iterate through each Lora model and output an image with each
      • the model you choose will be a balance between accuracy and flexibility in applying different styles - you can always improve image quality using HiRes fix hence flexibility may be preferred over accuracy

Use AI agents to improve the image generation

Leonardo.ai

  • cloud-based app based on Stable Diffusion that does not require install set ups or powerful computer and has an easier to use GUI than MidJourney
  • foundational model of 5 billion image-text pairs (LAION-5B dataset)
  • free account allows 150 images per day but limited tokens / mth
  • need to join up with Discord
  • you can upload yur own images to train a model
it/ai_imagegeneration.txt · Last modified: 2023/12/17 14:31 by gary1

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki