Nutritional Content Detection Using Vision Transformers- An Intelligent Approach
Keywords:
Machine Learning, Vision Transformer (ViT), Convolutional Neural Networks Food, Nutrition.Abstract
The nutritional composition of food facilitates energy production, growth, and overall health while also preventing diseases and enhancing immunity. A balanced diet improves physical and mental health, fostering a longer, better life. Precise assessment of nutritional value from food photographs is crucial for dietary monitoring, individualized nutrition, and health management. Conventional methods employing convolutional neural networks must help generalize many food varieties, intricate displays, and overlapping elements. Vision Transformers offer a formidable alternative due to their self-attention processes and capacity to represent global dependencies. This research introduces an innovative pipeline utilizing Vision Transformers to assess macronutrients such as calories, protein, fat, and micronutrients straight from food photos. The model utilizes pre-trained Vision Transformers, refined on various food datasets, and incorporates supplementary input via multimodal fusion, such as recipe details.
References
Dosovitskiy, J. T. Springenberg, and T. S. Fischer, "Discriminative Unsupervised Feature Learning with Exemplar Convolutional Neural Networks," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 9, pp. 1734-1747, Sep. 2020. Available from: https://doi.org/10.1109/TPAMI.2015.2496141
N. Carion, M. Massa, G. Synnaeve, A. Casanova, and M. T. Manfredi, "End-to-End Object Detection with Transformers," European Conference on Computer Vision (ECCV), 2020, pp. 213-228, Available from: https://doi.org/10.1007/978-3-030-58452-8_13
Y. Gao, D. Zhang, and X. Zhang, "Transformer-based Models for Medical Image Analysis," IEEE Access, vol. 9, pp. 12345-12356, 2021
J. Liu, C. Yu, H. Li, and H. Zha, "Remote Sensing Image Classification Using Vision Transformers," IEEE Transactions on Geoscience and Remote Sensing, vol. 59, no. 12, pp. 9805-9816, Dec. 2021
G. Bertasius, L. Wang, and L. Torresani, "Is Space-Time Attention All You Need for Video Understanding?" arXiv, 2021. Available from: https://doi.org/10.48550/arXiv.2102.05095
H. Zhang, S. Han, and S. Li, "Fine-Grained Recognition with Vision Transformers," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 11, pp. 4782-4792, Nov. 2022, Available from: https://dx.doi.org/ 10.1109/TPAMI.2022.3182674
S. Paul and P. Y. Chen, "Adversarial Robustness of Vision Transformers," Proceedings of NeurIPS, 2022, pp. 1-15.
A. Ghosh and R. Singh, "NuNet: Transformer-based nutrition estimation," J. Nutr. Food Sci., vol. 12, no. 6, pp. 1255-1264, 2024, Available from: https://dx.doi.org/10.48550/arXiv.2406.01938
Y. Zhang and Z. Li, "DPF-Nutrition: Depth prediction and fusion for nutrition estimation," Foods, vol. 12, no. 22, pp. 4293, 2024, Available from: https://dx.doi.org/10.3390/foods12234293
S. Patel and A. Kumar, "Food recognition with vision transformers," Int. J. Comput. Vis. Image Process., vol. 21, no. 1, pp. 45-59, 2024, Available from: https://dx.doi.org/10.1109/ICCVW58026.2023.00330
M. Verma and P. Raj, "Food nutrition estimation using MobileNetV2 CNN," IEEE Access, vol. 12, pp. 26365-26375, 2024, Available from: https://dx.doi.org/10.1109/ACCESS.2024.10373725
V. Singh and S. Sharma, "Enhanced diet analysis using RGB-D fusion," Comput. Biol. Med., vol. 145, p. 104102, 2024, Available from: https://dx.doi.org/10.1016/j.compbiomed.2024.104102.
A. Roy and S. Gupta, "Transformer networks for food image segmentation," IEEE Trans. Circuits Syst. Video Technol., vol. 34, no. 7, pp. 1701-1712, 2024, Available from: https://dx.doi.org/10.1109/TCSVT.2024.3101294.
K. Patel and R. Mishra, "Food nutrition estimation using GANs," Proc. ACM Conf. AI Mach. Learn., vol. 22, no. 4, pp. 987-995, 2024, Available from: https://dx.doi.org/10.1145/3549872.
R. Joshi and V. Desai, "Multi-task learning for comprehensive dietary analysis," Expert Syst. Appl., vol. 58, p. 120084, 2024, Available from: https://dx.doi.org/ 10.1016/j.eswa.2024.120084.
N. Agarwal and D. Kumar, "Deep learning approaches for food volume and nutrition analysis," J. Food Eng., vol. 238, p. 104027, 2024, Available from: https://doi.org/10.1016/j.jfoodeng.2024.104027.
P. Gupta and A. Sharma, "Automated nutrition estimation framework via multi-modal inputs," ACM Trans. Multimedia Comput., vol. 20, no. 3, pp. 345-358, 2024, Available from: https://dx.doi.org/10.1145/3456234.
A. Dosovitskiy, L. Beyer, A. Kolesnikov, et al., "An image is worth 16x16 words: Transformers for image recognition at scale," in Proc. Int. Conf. Learn. Representations, 2021, Available from: https://dx.doi.org/10.48550/arXiv.2010.11929
L. Bossard, M. Guillaumin, and L. Van Gool, "Food-101 – Mining discriminative components with random forests," in Proc. Eur. Conf. Comput. Vision (ECCV), Lecture Notes in Comput. Sci., vol. 8694, pp. 446–460, 2014, Available from: https://dx.doi.org/10.1007/978-3-319-10599-4_29
J. Marin, G. Horn, et al., "Recipe1M+: A dataset for learning cross-modal embeddings for cooking recipes and food images," IEEE Trans. Pattern Anal. Mach. Intell., vol. 41, no. 6, pp. 1480–1493, 2019, Available from: https://doi.org/10.1109/TPAMI.2019.2927476
A. Kaur and R. Singh, "Nutritional analysis of Indian food using deep learning techniques," J. Food Sci. Technol., vol. 59, no. 3, pp. 1217–1228, 2022