Project - Scraping an image for numbers

21 Sep 2013

As of this week, I’ve completed my first project that was submitted on this site. It was for a friend who has been learning to program to be more efficient at his job. He asked me to help him parse out some information from an image on a web page. At first I thought he meant some typical web scraping, but then found out that he wanted to determine what numbers were printed on rendered charts.

It was a bit daunting at first, so I told him that I may not have a solution for him. Then I started searching for python libraries to parse images. Eventually, I found a StackOverflow post of a person who suggested using OpenCV (http://opencv.org/). So then I set off on installing CCMake and OpenCV on my Mac. After enough time and effort my machine was setup.

My solution was to cut out small images of each of the digits, the decimal point and the negative sign on the chart, then download the newly rendered image and use the matchTemplate method to find where those characters appeared. The code below shows the method I wrote for finding all instances of an image of a digit and returning their coordinates.

def get_all_matches(template, image):
    """
    Fetches all the matching templates in the image.
    """
    result = cv2.matchTemplate(image, template, cv2.TM_CCOEFF_NORMED)
    threshold = 0.95
    loc = np.where(result >= threshold)
    return_list = [
        (pt[0], pt[1]) for pt in zip(*loc[::-1])
    ]
    return return_list

Once I had the list of coordinates of all the digits, I created a dictionary with the key being the digit and the value being the list of coordinates the digit’s appearances in the image. Then I made use of the pattern in the images where the numbers I was attempting to parse were between two words. For example, It might be Open: 500.30 High: 531.35. So I found the coordinates of the words “Open” and “High” and checked all of the digits’ coordinates for any in between those two coordinates. Below is the method that handled that logic.

def get_value_between(start_tup, end_tup, numbers):
    x_range = (start_tup[1], end_tup[1])
    y_mid = (start_tup[0]+end_tup[0])/2
    y_range = (y_mid-2, y_mid+10)
    results = []
    for key, elements in numbers.iteritems():
        for x, y in elements:
            if x_range[0] <= x <= x_range[1] and y_range[0] <= y <= y_range[1]:
                if key == ".":
                    results.append((x, y, -2))
                elif key == "-":
                    results.append((x, y, -1))
                else:
                    results.append((x, y, int(key)))
    sorted_results = sorted(results, key=itemgetter(0))
    value = 0.0
    has_decimal = False
    is_negative = False
    decimal_places = 0
    for x, y, digit in sorted_results:
        if digit == -2:
            has_decimal = True
        elif digit == -1:
            is_negative = True
        else:
            if has_decimal:
                decimal_places += 1
            value *= 10.0
            value += digit
    if is_negative:
        value *= -1
    return value / (10.0 ** decimal_places)

I had a few hacks in there to handle the decimal and negative sign, but I’m pleased with the overall result and so was my friend. I was extremely pleased to hear that he has plans to expand the script to check additional numbers in the image.

As for numbers, I spent about 15 hours on this project. However, the large majority of that was installing CCMake and OpenCV. The actual software development (requirements, coding, testing) was closer to 5 hours. Regardless, this was a fun project and it gave me exactly what I wanted!