Food Generation From Ingredients and Cooking Method


Team name: 0 error, 0 warning

Team members: Hanyuan Xiao, Kaijie Cai, Buke Ao, Heng Zhang

Realistic generated food images
Figure 1. Realistic generated food images


Food image generation as one of image generation tasks is useful in visualization for almost everyone. Given food ingredients and cooking methods (e.g. bake, grill), people may wonder name and image of the dish that can be cooked. For example, chefs may want to try so many new ingredients and cooking method to invent new menu. Parents may be worried about whether dinner will be attractive to their children and consider nutrients at the same time. Based on the same ingredients, can we make something new and interesting? Even students who have a deadline may want to spend the minimum time to cook their lunch or dinner with whatever in the fridge. Therefore, such an image generator can provide a high-level idea about what they can cook.

Besides sparks and interest that can be brought to the public in this project, outputs of our model can also be used to evaluate and quantify vital criteria of food with attention drawn by Computational food analysis (CFA) [1] such as meal preference forecasting, and computational meal preparation. Therefore, the model defines its importance and usage in real life and is crucial to human life. Existing approaches such as The Art of Food does not take cooking method as input. However, the importance has been overshadowed since the same ingredients can be made into different dishes. For instance, chicken and noodles can be made in ramen or fried noodles by boiling and stir-fry, respectively. Therefore, this project aims at developing a reliable method to generate food image that fits in any specific class.

Problem Statement

Objects in images have many attributes that represent their visual information. On the other hand, the attributes could be described by texts either. Hence, if the connection between images and texts is learned, then we are able to generate images with text as input. Furthermore, the problem could be solved by two steps.

The connection between the image pixel and the text description is highly multimodal, there are many possible mapping relationships between them. This multimodal learning is hard but finding the shared representation across different modalities is essential, besides, the generalization to unseen data is also a basic problem.

One way to generate images from texts is implemented by encoding the texts into class labels, which may cause loss of information and inaccuracy because the class labels are not good representations of original texts and there can be a large number of classes due to diverse combination of texts that the model cannot handle. Instead of directly using class labels, [2] proposed an end-to-end architecture to generate images from text encodings by RNN and GAN, but the associations between texts and images as well as loss functions are not well established. In this project, we use two stages – an association model and a generative model – to address this problem.


To address this problem, we use a recipe association model which is able to find the common representations (i.e. text embeddings) between images and text input, and then a GAN to generate images from the embeddings.

Cross-modal Association Model [3]

Association model from ingredient + method and images
Figure 2. Association model from ingredient + method and images

The loss function of association model is:

where equation is positive pair between text embeddings and extracted image features. equation, equation are negative paris. equation is the bias to train the model on pairs that are not correctly associated, which is set to 0.3 for cross-validation.

This network takes ingredients and cooking methods as input from one side, and uses images as input from another side as shown in Figure 2. The ingredients and cooking methods are encoded by LSTM and concatenated together to get the representative text embedding. The feature extraction from images is achieved by ResNet [4] and then tuned based on our dataset and task. Finally, cosine similarity is used to compute similarity between image features and text embedding. Ideally, for positive pairs of image and corresponding text embedding, the similarity is as large as 1; for negative pairs, the similarity is smaller than a marginal value based on task and dataset.

Conditional StackGAN [5]

StackGAN for image generation
Figure 3. StackGAN for image generation

After we extracted meaningful and respresentative text embedding from ingredients and cooking methods by trained network in the association model. The text embedding for each training case is then used as the conditional code in StackGAN. In order to ascertain the food image has the expected ingredients and methods that it depends on, we added cycle-consistency constraint [1] to guarantee the similarity between generated fake images and text embedding strong.

The loss function in [1] for image generation used in conditional GAN is:

In the equation, we exploited both conditioned and unconditioned loss for discriminator. The loss of cycle-consistency constraint is incorporated as the term. The last part is the regularization factor, which aims at ensuring the distribution of conditions given extracted image features to approximate the standard Gaussian distribution as closed as possible. Loss weight hyperparameters are determined by cross-validation.



We conduct our experiments using data from Recipe1M [6]. Recipe1M dataset consists of more than 1 million food images with corresponding ingredients and instructions. We manually extracted and chose 12 different types of cooking methods that are believed to be meaningful and distinguishable statistically, and then generated cooking methods for each training data by searching for keywords in the instruction text. We also reduced the number of different ingredients from around 18,000 to around 2,000 by removing ingredients with low frequency ( < 500 occurrence in the dataset) and then combined ingredients that belong to the same kind contextually (e.g. different kinds of oil which have the same features in images) or trivially (e.g. 1% milk and 2% milk). Because of the limit of time and computing resources we used only 10,000 data from the dataset to train.


We feed association model with paired and unpaired 128 × 128 image and text input. For the StackGAN model, we feed text embedding as conditions and random noise to generator. For discriminator, we feed both 64 × 64 and 128 × 128 images from our dataset and from generator. The real images can be paired with their crossponding text or random text.


We evaluated our task and approach via qualitative and quantitative results. In qualitative part, we demonstrate that our results are valid and meaningful under different conditions. In quantitaive part, we show two tables to compare the performance of our model with prior work.


Besides Figure 1 where we show several realistic generated images from our model, here we compare the influence of two inputs – ingredient and cooking method – on image generation.

fixed ingredients, change cooking method (1)
Figure 4. Fixed ingredients (pork chops, green pepper and butter) and change cooking method

In Figure 4, ingredients are fixed as pork chops, green pepper and butter, but cooking method is changed from stir+fry to boil.

fixed ingredients, change cooking method (2)
Figure 5. Fixed ingredients (cheese, egg and pizza sauce) and change cooking method

In Figure 5, ingredients are fixed as cheese, egg and pizza sauce, but cooking method is changed from boil+heat to bake+stir.

fixed cooking method, change ingredients (1)
Figure 6. Fixed cooking method and add blueberry

In Figure 6, cooking method are fixed as bake as for muffin, but blueberry is added as extra ingredient. Blueberry is added to the top and inside muffin and we can see such dip in muffin with blueberries.

fixed cooking method, change ingredients (2)
Figure 7. Fixed cooking method and add chocolate

In Figure 7, cooking method are fixed as bake as for muffin, but chocolate is added as extra ingredient. Chocolate is mixed with flour to prepare base for muffin and we can see muffin with chocolate in a darker color which represents chocolate.

Figure 8. Generated images of pork with different noise

In Figure 8, we show generated images of pork with different noise input.

Figure 9. Generated images of pork with different cooking methods

In Figure 9, we show generated images of pork with different cooking methods.


To evaluate the association model, we adopt median retrieval rank (MedR) and recall at top K (R@K) as in [1]. In a subset of recipe-image pairs randomly selected from test set, every recipe is viewed as a query to retrieve its corresponding image by ranking their cosine similarity in common space, namely recipe2im retrieval. MedR calculates the median rank position of correct image, while R@K measures the percentage of all queries when true image ranks top-K. Therefore, a lower MedR and a higher R@K implies better performance. To evaluate the stability of retrieval, we set subset size as 1K, 5K, and 10K respectively. We repeat experiments 10 times for each subset size and report the mean results. Im2recipe retrieval is evaluated likewise. In Table 1, we show the discussed quantities. Our model outperforms in all scores, which proves that canonical, clear ingredients and addition of cooking method as input are important to the task.

Quantitative Evaluation for Cross-modal Association Model
Table 1. Quantitative Evaluation for Cross-modal Association Model</br>

We used inception score (IS) and Fréchet Inception Distance (FID) to evaluate results of GAN, where IS is computed for batch of images while FID is computed to compare difference between real image set and fake image set. The higher IS and lower FID are, the better quality and diversity are for our generated images. In Table 2, the comparison is based on same model structure, parameters, training and test cases and approximately the same IS for real image sets. The only difference is the input type. The image-input model has only noise as input for generator. The ingredient-input model has noise and ingredient text embedding as input for generator. The ingredient+method model has noise, ingredient text embedding and cooking method text embedding as input.

Quantitative Evaluation for GAN
Table 2. Quantitative Evaluation for GAN

Based on Table 2, we successfully proved that cooking method, as an extra input, is a useful and valuable input for food image generation task.

Future Improvements

From the experiments, we find that there are some improvements can be made in the future.

A batch of generated images
Figure 10. A batch of generated images

FYI, we upload the loss curve to compare different inputs. We welcome any insightful suggestions on improving the performance. See Figure 11 for all loss curves in 150 epochs in our training. See Figure 12 for loss curve of ingredient+method model for 520 epochs that we trained in total.

Loss curves of models with different inputs in 150 epochs
Figure 11. Loss curves of models with different inputs

Loss curve of model with ingredient+method as input in 520 epochs
Figure 12. Loss curve of model with ingredient+method as input in 520 epochs


We acknowledge the assistance and advice from professor Joseph Lim and wonderful TAs of course CS-566 (Deep Learning and its Applications). With their guidance, we developed the project and made the following contributions.


