Digging into the history of Generative AI

Aayushi Mittal
5 min readJan 28, 2024

Over the past few years, AI advancements have led to the creation of deep-learning models that can generate new images from text inputs, making it easier for people to imagine things. Big tech companies like OpenAI, Google, Facebook, and others have created text-to-image tools that are not yet available for everyone, but there are also similar models developed by independent open-source developers and smaller companies like Midjourney.

In 2015, AI research made a significant advancement with automated image captioning where machine learning algorithms could label objects in images and then describe them in natural language. This sparked the curiosity of a group of researchers, who wondered if the process could be reversed. Instead of image-to-text, they decided to try text-to-image generation, which was a more challenging task. They aimed to create entirely new, unseen scenes rather than simply retrieving existing images like a Google search. It was something that was never encountered before.

Over the past year, a group of independent open-source developers has constructed text-to-image generators using available pre-trained models. To have an image generator that can handle many different requests, it needs a large collection of diverse images and their descriptions. This is achieved by collecting hundreds of millions of images from the internet along with their accompanying text descriptions, which come from things like the text that website owners include with their images for accessibility and search engine purposes. This is how the engineers build their big data collections. Models search through training data to find similar pictures and copy some of their pixels. However, the new image that was created did not come from the training data. It came from a special place called “latent space” inside the deep learning model. To understand how the model works, we need to see how it learns. If you were given images and told to match them with descriptions, it would be easy for you. But for a machine, images are just a bunch of numbers for red, green, and blue. At first, the machine has to guess, but over time it can figure out the best way to match images with descriptions using deep learning.

The model finds ways to separate different images using math, and it creates a space with more than 500 dimensions to represent this information. This space is too complex for humans to understand, but it allows the model to group similar images. Any point in this space is like a recipe for an image, and the text prompt helps the model find the right recipe. The last step is to turn the math into a real image using a process called diffusion. It starts with just random pixels and over time they turn into an image that makes sense to humans. Because of some randomness in the process, the same text prompt will never produce the exact same image. If you use a different model created by someone else, you will get a different result because it exists in a different latent space.

AI before 2021
Generative AI Before 2021

AI-generated art that was available before 2021 was abstract and not easy for people to understand or relate to. But now, with recent advancements in AI technology, the art generated by AI is much more versatile and can be tailored to a person’s specific needs or preferences. This allows for greater creativity and control over the final result. “Multimodal learning” is a recent development in the field of Artificial Intelligence that has greatly improved the creation of AI-generated art. The idea behind multimodal learning is to teach AI to understand the relationship between text and images. This has resulted in AI models that can write captions for images and even generate images based on a given caption, making them very useful for artistic purposes. The increased interest and attention in AI-generated art have also helped to speed up the development of these techniques. As AI models improve and training data evolves, the art generated by AI is constantly changing and never reaches a saturation point.

Midjourney Generative AI

The datasets used by OpenAI and Midjourney are unknown, but it is known that the internet has a bias towards the English language and Western ideas, leaving out entire cultures. There are also ethical concerns with AI-generated art, such as its potential for misuse or spreading false information. AI-generated art lacks the human touch. AI-generated art may look realistic, but it does not have the emotions or personal stories that a human artist might bring to their work. AI-generated art can be repetitive or boring because it relies on the data it was trained on, which may not always be updated with new information. This can lead to a lack of control over the final product, as the AI’s output is based on its trained weights.

Adobe, along with many other innovators, is exploring the potential of Generative AI technology, which has the potential to revolutionize the way artists approach and generate creative ideas, making creativity more accessible to a wider audience. At the recent Creative Cloud keynote, Adobe showcased its latest AI tools including photo restoration and background replacement in Photoshop, one-click color correction in Premiere Pro, and text-to-image prompts in Adobe Express. To alleviate concerns that AI might someday replace artists, Adobe’s Chief Product Officer, Scott Belsky, stated that AI should always be seen as a “co-pilot” in creative endeavors. Adobe aims to develop its AI technology with a focus on how it can benefit artists rather than replace them.

These automated tools for generating images are changing how things are done because now people don’t have to be experts in making images. Instead, they just have to be good at thinking creatively, using language well, and choosing what to show. The special thing about this technology is that it allows anyone to tell the machine what they want it to imagine. This makes it easier to turn ideas into images, videos, animations, and even virtual worlds. It’s a change in the way people imagine, communicate and interact with their own culture. However, like other automated systems that are trained on historical data and internet images, there might be some problems that we haven’t figured out yet.