Object Counting with Visual AI has wide-ranging applications across numerous fields. In retail, automated shelf auditing systems count products to maintain optimal stock levels. In manufacturing, counting components on assembly lines ensures quality control and detects production anomalies. Medical researchers use cell counting in microscopic images for disease diagnosis and drug development. In safety and security, counting people in crowded spaces helps manage occupancy levels and ensures compliance with safety regulations.
Traditionally, object counting with computer vision has been a complex process requiring significant expertise. Developers typically collect and label extensive datasets, select and train machine learning models (often CNNs), and implement post-processing logic. This approach involves data preprocessing, model architecture design, validation, and deployment coding. The multi-step workflow makes implementing such a solution time-consuming and resource-intensive for many organizations.
VisionAgent, the generative Visual AI application builder from LandingAI, makes tasks such as object counting for common objects quite simple by leveraging state-of-the-art vision language models (SOTA VLM models) and combining them with an agentic framework to generate custom code for your use case. This innovative approach kickstarts your project in minutes as compared to benchmarking different models for your use case and investing significant time and resources into adapting new models for your ecosystem.
Let’s go through a practical application: inventory management for soda cans. Imagine you’re tasked with creating a system to monitor soda can inventory in a storage facility. The goal is to automate calculating the inventory percentage based on a maximum capacity of 35 cans, and determine if restocking is necessary.
Prompt VisionAgent with Clear Project Requirements
When approaching this Visual AI use case, it’s crucial to formulate clear, detailed instructions — just as you would for a junior software engineer. For our soda can inventory system, we want the AI model to identify and count the cans, calculate the percentage of inventory based on our full capacity of 60 cans, and provide a status indicating whether the inventory is healthy or needs restocking.
This level of specificity helps VisionAgent generate more accurate and useful code, ensuring the output meets our exact needs.
We will start with a simpler but precise prompt for prototyping with VisionAgent for the soda can use case:
“Write a program that counts soda cans in an image. The program should output the count, draw bounding boxes around each detected can, and display the confidence score for each prediction. Use a threshold of 0.2”
Recap these steps in the following video:
Prototype a Solution
After we’ve prompted VisionAgent, it generates code that addresses our use case. This generated code serves as a starting point, which we can run, evaluate, and iteratively improve. During this prototyping phase, we might experiment with different phrasings in our prompt, adjust the specificity of our instructions, or fine-tune the output requirements.
The goal is to rapidly iterate until VisionAgent generates code that accurately detects and counts the soda cans, calculates inventory levels, and provides the status we need. This prototyping process not only helps us refine our solution but also allows us to understand the capabilities and limitations of VisionAgent for our specific use case.
Looking at the result for our use case, VisionAgent is able to identify all the soda cans but counts one of them twice (can you spot this in the screenshot?). Since this incorrect prediction has the lowest confidence score, I will update the confidence threshold to 0.5.
Also, the program prints the confidence threshold along with the count, which is something we don’t need. We can edit the code in VisionAgent to match the requirements or even prompt VisionAgent to remove the confidence score. VisionAgent remembers your updates for future conversations.
Recap these steps in the following video:
Validate, Iterate and Deploy
Once our initial code is ready, we can test the code with other images to validate if the code consistently works for our use case. We can upload the images directly to VisionAgent and test the code. These images will be saved for future use.
Recap these steps in the following video:
Vision Agent also lets you create a Streamlit app to share your prototype with your team for further validation.
Recap these steps in the following video:
Once the code is validated for detecting soda cans, we now prompt the agent with the rest of the requirement: “Additionally, calculate the inventory percentage based on a maximum capacity of 35 cans. If the inventory is below 50%, the program should output a status of “Needs Restocking.” If the inventory is above 50%, it should output a status of “Healthy.”
Once all requirements are met, we can copy or download this co-developed code with VisionAgent for inferencing by invoking the count_soda_cans function in the desired environment.
VisionAgent Team is also actively developing a web deployment endpoint to simplify the deployment process.
More on CountGD..
Our internal benchmarks revealed that the Multi-Modal Counting Model powering VisionAgent, CountGD, consistently delivers results with less than a 5% margin of error. This level of accuracy is maintained even in complex scenarios involving over 100 objects of interest within a single image.
CountGD is also able to work with overlapping objects often seen in real-world use cases. However, being color agnostic, CountGD faces limitations in recognizing colors and shows reduced precision in scenarios involving the detection of multiple classes. To mitigate multi-class limitation, the function can be executed twice or the image can be pre-processed to segregate the classes, enhancing accuracy.
Conclusion
Within minutes, we rapidly moved through prototyping, iterating, and deploying code for an object counting use case that was easily articulated in text. Previously, building a similar solution would have involved using an object detection model coupled with extensive post-processing, requiring significant effort in labeling, training and integration. In situations where higher accuracy is needed beyond what current models can provide, we can leverage the developed code for zero-shot labeling as a foundation for training more sophisticated deep-learning models.
With new releases of Multi-Modal Models many such Visual AI tasks that were previously cumbersome can now be iterated easily. Aligned with our mission to simplify computer vision development, we’re democratizing cutting-edge Visual AI and Generative AI technologies, making it easier to develop Visual AI solutions. Get early access to VisionAgent today at va.landing.ai/agent.