The distinction between how machines perceive and identify objects compared to humans may not be readily apparent. As humans, we possess the remarkable ability to instantaneously recognize objects, often including those we have never encountered before, by drawing upon our memories and collective experiences. Through unconscious cognitive processes, we effortlessly establish intricate connections between new objects and our existing knowledge, enabling us to approximate their purpose and, at times, even accurately guess their names. Despite the rapid advancements in GPUs, algorithms, and vast datasets, machines have not yet attained this level of cognitive flexibility. The challenge lies in their limited capacity to think creatively and venture beyond predefined patterns. This article aims to provide a fundamental understanding of how object detection functions in the realm of machine learning. It eschews complex neural network computations, sophisticated terminology, and convoluted formulas, making it accessible to readers seeking a clear and concise overview of the topic.
Understanding Computer Interpretation of Images and Object Detection in Machine Learning
To get us all on the same page I would like to start with explaining how computers interpret an image. Our human eyes are connected directly into brain where light information is processed and bang! We see. Computers do it differently. First, the camera captures an image as a set of pixels with different color values. This information can be converted into binary form (zeroes and ones). Those values are what machine learning models are processing. This concept is important to understand, because every model is learning sets patterns of pixel patterns that are close to each other. They do not interpret an image as a whole. Once you understand that, I will introduce you to definition of object detection in the machine learning world. Object detection consists of two parts:
- Object localization — identifying the location of an object in a given image. Usually this means returning the object’s center point and dimensions of the bounding box containing it.
- Object classification — This is where the model names the objects like cars 🚗 , apples 🍎 , flowers 🌷. It is important to note, that it can only assign a category label from fixed set. It cannot create a new category.
If the model was taught to detect only one object, we use terms like object detection (which is more popular) and object localization interchangeably. Object localization is by far the easiest way a computer can see. It recognizes a set of pixel patterns and can roughly estimate what would be the bounding box of the object. But what if we wanted some more precision? What if we want to know exactly where does the road end and where does the pavement begin. This is important information cannot be extracted from object detection models. Thankfully, there are two well documented methods that can help us get this valuable information:
- Semantic segmentation — this method retrieves specific boundaries of each object. It classifies every pixel on an image into an explicitly taught category. But what if there were several almost identical objects? Semantic segmentation models wouldn’t be able to tell the difference between them. In that case, thankfully there is:
- Instance segmentation — this is a subclass of semantic segmentation. It expands on semantic segmentation by recognizing different instances (hence the name) of similar objects.
Knowledge of computer vision terminology is important when looking for a solution to a specific problem. Now you know why wouldn’t want to use a self-driving car which operates on object detection to drive. You can now also look at the first image of this blog post and instantly recognize which techniques were used to generate the coloring. In the next post I will walk you through steps on how to make your own working machine learning object detection model in less than an hour. Interested? Stay tuned!