My robot needs a computer vision system that can run on a single board computer. Many image processing and computer vision libraries make design tradeoffs favoring broad algorithm coverage, accuracy and extensibility rather than performance on a small platform. While this is good for applications with workstation class computers, it is bad for small robots and other deeply embedded systems.
EmbedCV does not try to do everything. It is only focused on embedded computer vision practical for low-end computing platforms (e.g. a 200 MHz ARM or Pentium single board computer). So EmbedCV trades accuracy for speed and has a very limited selection of algorithms. It is not a general purpose image processing library.
It is still very early in development. EmbedCV is not mature. The library version number reflects this. While many core algorithms are implemented, there are many more that need to be. For instance, noteably absent is any support for the Hough transform. Movie formats are completely unsupported at this time. Image format support is limited to 24 bit binary format RGB PPM files and decompressing JPEGs using the IJG libjpeg library. This is pretty much the bare minimum.
There really are two libraries: libecvcore and libecvaux. All third party library dependent code (such as JPEG decompression using libjpeg) is placed in libecvaux. So libecvcore has no dependencies outside of the standard C library that is a part of a normal toolchain. Development emphasis at this time is on core algorithms rather than supporting code such as media formats.
Statistics support is primitive. Besides efficient calculation of mean and variance, there is Otsu's thresholding method. Clustering and analysis of histograms is needed.
Feature detection using the integral image transform is supported. Objects can be detected (sort of). This can be used for obstacle avoidance and tracking. But really a much deeper infrastructure is required (AdaBoost learning algorithm to generate a feature cascade). Still, it does work despite being very primitive. This dispells the feeling that object detection must be magic.
Almost certainly there are bugs. In writing and experimenting with the sample code, I think the most obvious ones have been fixed. Keep in mind that this library values speed over accuracy. For example, the edges of images are not always handled correctly. Morphological operations can never change the pixels on the image frame edges. This is a side effect of the bit shifting logic. Convolutions like blurring and Sobel edge detection are also inaccurate along the image frame edges. While the image frame edges could be handled as special cases, this should not be too important for a computer vision application. Accuracy is traded for speed (and to some extent, simplicity).
EmbedCV uses a lot of pointer arithmetic and bit shifting. I have read that "Pointers are poison to optimizing compilers" (Efficient Memory Programming by David Loshin). Macros are used to explicitly force loop unrolling in many places. The basic assumptions are:
One wish for future development is optimized MMX code (ARM would be nice too) to use the packed register instruction set. EmbedCV is really designed for this as it does not use floating point anywhere. Speed would increase several-fold.
Image processing theory can be very confusing without some basic intuition. I didn't know anything about image or signal processing beforehand. (full disclosure - I was an applied mathematics major in school) Developing this library is part of the learning process for me. Here are a few of the fundamental concepts explained in a non-mathematical way. I find that people who have very deep experience in a subject area are usually able to present it in terms of common sense. Hopefully, I know enough to convey some basic ideas about image processing and computer vision (although I lack deep experience).
This is actually a deep question. We usually think intuitively of an image as a photograph or a single frame from a movie. In this case, an image is a flat, two dimensional grid with a pixel of light at each point. Sometimes, an image is represented digitally in a computer in exactly this way. It is stored as an array of bytes, each one corresponding to a pixel in the image. However, alternative representations of an image are often far more useful for digital computers.
One major reason for representing images in other ways is the need to make them smaller. An array of bytes corresponding directly to pixel values is very large and requires lots of memory. So image representations are usually compressed for storage and transmission. Probably the most common method is JPEG. It uses an image transform based on spatial frequency, the inverse discrete cosine transform. The compression is possible largely because high frequency components in the transformed image are discarded. High frequency tends to mean small features in an image. Human perception doesn't notice this too much. The mildly altered decompressed image still appears normal. This approach to compression is very common. Another example, not involving images, is the telephone system. Human speech is dominated by lower frequencies. The phone system discards higher frequencies to reduce the bandwidth required for a telephone call. Speech is still intelligible and not altered too much so we can understand it.
Computer vision represents images in other ways for a different reason. Very often, a transformed image representation renders some useful aspect or quality more obvious or easier to compute. Images typically are very large and contain an overwhelming amount of data. Computers understand images by selectively discarding extraneous information until all that is left is the pertinent stuff. Image transforms make this possible. There is an aphorism that Artificial Intelligence is really about search problems. In the same vein, I think of computer vision as searching through the "image information space". Out of an enormous amount of data comes a small and tangible result with clear meaning. This is the goal of computer vision.
Many small robots that can see use a technique commonly referred to as "color blob detection". If there is a red ball in a blue room, then filtering out all blue pixels leaves only the red ball. We could do this by holding up a red tinted filter and looking through it. All of the blue light would be blocked by the filter. We would only see the red light from the ball. This sort of technique is sometimes used for seeing golf balls. Tinted sunglasses can absorb more green light from grass, making a white golf ball stand out more.
Color blob detection is an example of image segmentation. Typically, a threshold is applied to each pixel in the image. If the pixel value is sufficiently bright, dark, or close enough to a certain color, then the pixel is marked as what we are interested in. Hopefully, the marked pixels represent some object or quality in the image that has meaning. Real world image segmentation (e.g. radiological imaging of body organs and structures) becomes quite complicated as the criteria for marking a pixel as a part of the segment of interest often involves neighboring pixels and other constraints. The complexity is necessary for good results.
A convolution takes a pattern (the convolution kernel or mask) and slides it over an image in every possible position (sometimes orientation and size too) to find a match. What results is a new image that shows how well the pattern matches in each position. It is the same thing as trying to assemble a single piece into a jigsaw puzzle by trying all possible ways it might fit. Just like solving a jigsaw puzzle, convolution takes time to compute. But it is a core concept in computer vision (and image processing as well). Many algorithms and approaches are at some level using convolution.
Sobel edge detection is an example of a convolution by sliding a pattern that tries to match small regions of the image that have widely different pixel values. This normally indicates an edge. Here is a Sobel kernel for detecting edges from left to right in an image.
-1 0 +1 -2 0 +2 -1 0 +1
Just imagine an image of a tall pole against a bright sky. What happens when the Sobel kernel above slides horizontally across the image?
When the kernel is over sky, the pixel values are very nearly the same. So the left hand negative numbers in the mask (-1, -2, -1) are nearly canceled out by the right hand side positive numbers (+1, +2, +1) when multiplied by the corresponding pixel values from the image. The result is very small values over clear sky. This is what we expect as the sky does not have edges in it.
The situation changes as the kernel mask slides over the pole. The pixels on the pole will be darker or at least a different value from the sky pixels. So with one side of the mask on sky and the other on the pole, the left and right hand sides do not cancel out as in the pure sky case. The sum of the left and right sides will be either a large negative or positive number, indicating a detected edge.
Another example of convolution is blurring an image. In this case, the pattern (convolution kernel) is looking for the best average pixel value in a small neighborhood. The resulting convolved image appears as blurred due to the pixel averaging. One form of blurring is so important that it deserves special mention. Gaussian convolution is a way of blurring an image uniformly in all directions (you may see the term isotropic) without distorting it unnecessarily. Useful information is preserved while high frequency noise and detail is discarded. Gaussian convolution is in many ways the optimal way to blur an image.
Convolution is a very powerful and useful technique. Many operations that we would like to perform across an image can be represented as a convolution kernel. The kernel pattern is really a finite approximation to the operation. Sobel edge detection looks for rapid change among nearby pixel values. Blurring is the opposite. It seeks a representative average value. Really, edge detection is computing the derivative (representing change in values). Blurring computes the integral (representing a sum divided by the number counted - the average). There are many different variants on this basic idea. But the principle is the same.
All techniques (that I am aware of) search for a pattern (this is where convolution comes in) at all possible positions in an image (translation invariance) at all possible sizes (scale invariance). Some approaches can also compensate for differences in orientation (generalized Hough transform), lighting, and perspective. Translation and scale invariance are the two necessary qualities for an object recognition algorithm. Objects can appear anywhere in an image at different distances from the camera. So the technique must be able to find objects at any position and scale.
Translation invariance is an inherent part of convolution. But scale invariance is not. Normally, convolution only tries to match a fixed size pattern to a single image. This leads to the two approaches to scale invariance (again, that I am aware of).
Object recognition is generally viewed as a "hard" problem. Many convolutions at different scales requires a very large amount of calculation. The same object may have a significantly different appearance between images of it due to lighting, occlusion or distortion, orientation and position. Yet, we wish that the computer vision system detects the same object despite how different it can look between images. In practice, current technology is still extremely limited compared with human performance. My personal assessment is that object recognition technology is about where speech recognition was in the 1970s. (I was a child then so am not writing from practical experience, only my opinion.) Current systems act as classifiers for a small number of reference objects they know about after extensive (and expensive, in terms of time and effort) training. Object recognition done by computer vision systems today is not very general or robust.
Still, one increasingly common example of object detection using computer vision is face detection systems in digital cameras. Faces stand out by the chromaticity of skin, their round shape, centrally framed position in the image (usually that's how we take photographs of other people), and the patterns of light and dark across the eyes, forehead and cheeks. This functions well enough to be useful in mass market commercial products. So like speech recognition, this technology is not entirely exotic. There are practical applications.
Statistics underlies a great deal of computer vision. It is typically impossible to isolate the desired information in an image cleanly. The result of image transforms and techniques is somewhat messy. Probability must be used to determine the most likely interpretation. So if we are looking for an object in an image, the result might be a noisy plot of possible positions. Which one is the right one? Another problem is identifying an image. What is it? Does it look most like a house, a dog, a tree or a person? We need some approach to pick the most probable answer.
This is where statistics is used. Fundamentally, this becomes tallying counts of image features in a collection of bins (the histogram). Each bin represents some aspect of the image. After counting, the bin with the largest number of counts is chosen as the most probable one. Statistical methods becomes a deep subject that I do not really know very much about at this point (everyone knows what an average and variance is but beyond that...). So I won't write anything more about it here.
The Hough transform was invented to find the straight line particle tracks in bubble chamber photographs. Given a collection of points, how do we find the lines that intersect the most number of them? These are the likely paths of the particles through the chamber. More generally, detecting higher level features like lines in a noisy image is useful. The Hough transform can also be applied to circles or just arbitrary shapes. The principle is the same.
The Hough transform works by brute force. Given two distinct points, there can be only one line. Given three points, there can be three possible lines at most. With four points, there might be as many as six lines. But there may be fewer. If so, that means that one of the possible lines contains more than two points. How do we find this line? The Hough transform tries all of them! So for every possible line, count how many points it passes through. The lines that pass through the most points are most likely the "real" lines. I know this sounds completely obvious. Just keep this in mind when reading more mathematical descriptions of the technique that may cause confusion.
The fundamental morphological operations are:
Conceptually, these operations can give some "surface tension" to regions in an image. If a region has small holes in it (e.g. image segmentation often results in a region that is peppered with small holes), then dilation can close the holes. Of course, the region has also grown larger as pixels are added around the outer edges. But if the dilation is followed by erosion, those extra pixels are removed. The result is approximately the original region, except that the small holes have been filled. This is called "closing" when dilation is followed by erosion.
The opposite order of erosion followed by dilation tends to remove small tendrils poking out from the region (they are so thin, they evaporate away when eroded). The first pass of erosion trims the small parts away. Then the dilation restores the approximate original edge of the region. Very small regions may completely disappear entirely when eroded. This is called "opening" when erosion is followed by dilation.
Another use of morphology is edge detection. If a region is eroded or dilated into a slightly different size, then the difference between the original region and the new one is approximately the region edge.