Here is an approach - it is probably not perfect but you can maybe tweak it if you have more images available to test with.
The basic idea is to fatten up the individual letters so they touch each other but hopefully without bridging across to adjacent words. Then do a "Connected Components Analysis" to find the individual words of your original text as blobs.
Here is the first step - the fattening of the letters with ImageMagick
convert text.png -threshold 50% -morphology erode diamond:4 step1.png
I am using morphology techniques above, but you could equally try blurring and thresholding techniques instead.
Now find the "blobs":
convert step1.png \
-define connected-components:verbose=true \
-define connected-components:area-threshold=100 \
-connected-components 8 -auto-level output.png
Sample Output
Objects (id: bounding-box centroid area mean-color):
0: 1086x188+0+0 556.4,83.0 155156 gray(255)
7: 364x65+128+118 281.8,142.6 9206 gray(0)
6: 212x34+817+115 919.3,131.1 4691 gray(0)
4: 231x33+73+76 184.4,92.3 4645 gray(0)
2: 181x42+494+8 578.4,27.7 4399 gray(0)
9: 209x31+608+118 713.1,132.9 3892 gray(0)
17: 148x34+826+148 903.0,165.1 2932 gray(0)
22: 132x34+20+153 84.1,169.1 2453 gray(0)
20: 126x27+384+151 443.4,165.9 2404 gray(0)
1: 91x42+396+8 440.3,29.0 2390 gray(0)
18: 117x34+708+149 764.2,165.5 2350 gray(0)
21: 104x33+509+151 560.3,167.6 2245 gray(0)
23: 112x27+271+158 325.8,169.3 2159 gray(0)
8: 100x33+507+118 558.2,134.6 1982 gray(0)
19: 91x33+615+150 659.4,166.2 1888 gray(0)
10: 55x25+73+121 100.0,134.4 920 gray(0)
3: 28x29+361+12 373.8,27.5 456 gray(0)
Each line above corresponds to one blob, or hopefully one word of your original text. The first line is a header line which tells you what the fields are. The second field on each subsequent line is the WIDTH X HEIGHT and X Y OFFSET (from top left corner of the image) of a blob followed by its centroid and its area and the last field is the mean colour.
You don't need this next step, as I guess the lines of text describing each word are what you need, but by way of illustration, I can draw in the boxes on the original image:
convert "text.png" -stroke red -fill none -strokewidth 1 \
-draw "rectangle 128,118 492,183" \
-draw "rectangle 817,115 1029,149" \
-draw "rectangle 73,76 304,109" \
-draw "rectangle 494,8 675,50" \
-draw "rectangle 608,118 817,149" \
-draw "rectangle 826,148 974,182" \
-draw "rectangle 20,153 152,187" \
-draw "rectangle 384,151 510,178" \
-draw "rectangle 396,8 487,50" \
-draw "rectangle 708,149 825,183" \
-draw "rectangle 509,151 613,184" \
-draw "rectangle 271,158 383,185" \
-draw "rectangle 507,118 607,151" \
-draw "rectangle 615,150 706,183" \
-draw "rectangle 73,121 128,146" \
-draw "rectangle 361,12 389,41" result.png
You don't actually need to create the intermediate images like I did above, you can do it all in one go as follows, but I wanted to explain my technique:
convert text.png -threshold 50% -morphology erode diamond:4 \
-define connected-components:verbose=true \
-define connected-components:area-threshold=100 \
-connected-components 8 -auto-level output.png
The output image (output.png
) is actually "labelled" by which I mean that all the pixels of each identified blob are coloured in a successively lighter shade of white.
Note that there are other structuring element shapes and sizes that may give better results with your images, e.g.:
convert text.png -threshold 50% -morphology erode disk:3 result.png
See Anthony Thyssen's excellent introduction to morphology here.
Regarding your further question of splitting the image into its constituent parts, each containing a word, you could do the following...
Pipe the output of the previous convert
command into awk
to find all the lines that contain the word gray
and print out the geometry of the box in field 2.
convert ... as above ... | awk '/gray/{print $2}'
Sample Output
1086x188+0+0
364x65+128+118
212x34+817+115
231x33+73+76
181x42+494+8
209x31+608+118
148x34+826+148
132x34+20+153
126x27+384+151
91x42+396+8
117x34+708+149
104x33+509+151
112x27+271+158
100x33+507+118
91x33+615+150
55x25+73+121
28x29+361+12
Now split that on the plus sign to separate the X and Y:
convert ... | awk '/gray/{split($2,a,"+");print a[1],a[2],a[3]}'
Sample Output
1086x188 0 0
364x65 128 118
212x34 817 115
231x33 73 76
181x42 494 8
209x31 608 118
148x34 826 148
132x34 20 153
126x27 384 151
91x42 396 8
117x34 708 149
104x33 509 151
112x27 271 158
100x33 507 118
91x33 615 150
55x25 73 121
28x29 361 12
Now sort by Y then X so that words come out in line order (Y counts down from the top) then word order (X counts across from the left):
convert ... | awk '/gray/{split($2,a,"+");print a[1],a[2],a[3]}' |
sort -n -k3 -k2
Sample Output
1086x188,0,0
91x42,396,8
181x42,494,8
28x29,361,12
231x33,73,76
212x34,817,115
364x65,128,118
100x33,507,118
209x31,608,118
55x25,73,121
148x34,826,148
117x34,708,149
91x33,615,150
126x27,384,151
104x33,509,151
132x34,20,153
112x27,271,158
Now, pass the geometries into a loop and read them into bash
variables, then crop the original image and name the individual words with a simple index (i
).
convert ... | awk '/gray/{ split($2,a,"+");print a[1],a[2],a[3] }' |
sort -n -k3 -k2 |
{ i=0;
while read g x y; do
convert text.png -crop ${g}+${x}+${y} word-${i}.png
((i+=1))
done }
Note that if your text is even slightly rotated, say 1 degree anticlockwise, the words on the right end of any given line will have a smaller Y
coordinate than those on the left and so may come out before them in the image order when sorted. As such, you may need to round the Y coordinate to the nearest 10, or 20 or 40 in awk
so that if one word is 168 pixels from the top and another is 169 pixels from the top, they both get sorted as though they were say 170 pixels from the top and come out in the same line.