Music OCR

by Zale and Vaasu

We have created a program that takes an image of sheet music, and prints out the notes to play.


1) The key signature is C

Converting between key signatures is really easy so we could put key signature as a user input. It's just reading the signature from an image that is tricky.

2) Treble Clef.

Like key signature converting between clefs is very easy so that could be a user input too.

3) No Accidentals

Accidentals are beyond the scope of our computer vision algorithm.

4) All notes are within 1 ledger of the staff

This means that the notes range are from A to C. We could increase this range of notes, but it did not seem necessary for our demonstration so we just stuck with these notes.

5) Every note has the same rhythm.

Determing different rhythms is pretty tricky as the program would have to be able to decipher many different characters which can both look very similar and look very different from each


Step 1) Find lines using hough transform

Using hough transform, we can determine where the lines on a piece of sheet music are. There are many different parameters we can use to alter when using the hough transorm, so how program requires that the user alter these parameters of blurring value, determining weak and strong edges, minimum count to label a line, etc so that they can find good lines for the image.

Step 2) Filter lines by length and angle to determine staff lines

We know that the staff lines will be the longest line on a page of sheet music. Therefore we discard all lines that have a length that is much smaller than the length of the longest line. Likewise we compare the slope of this long line to the slope of all the other lines and discard all lines that have a line of a significantly different slope. We also rotate the image so that the staff lines are horizontal.

Step 3) Find a set of 5 lines that comprise a staff by looking at the gap between the lines

We are looking for five lines that have gaps that are all the same within a certain range of variance to each other. We search through our list of lines and if the difference between y positions of the lines of the adjacent parallel lines is the same as the gap of the previous two lines we look at the third gap, and if that is also the same we look at the fourth gap. If all these gaps are the same we mark a staff.

However if the second gap is not the same as the first, or third not the same as the second, or fourth not the same as the third, we start looking again for a new staff with the most recent gap being the first gap.

Step 4) Scan each staff using a window to find notes and their corresponding position in the staff

We create a scan window that travels left to right horizontally along each staff (remember we will have rotated the image so that the staffs are horizontal if there were not originally). The width and height of the window is gap, and we search downward along the columns of the pixels in staff and calculate the average darkness of the window. If the average darkness is below the average threshold (a user input), we declare that there is a note there. If there is a note we add gap to the next x position we scan and scan along the staff from that position. This is necessary so that we don't count the same note twice, but ensures that we don't jump too far and skip a note.

Also note that we determined the notes above and below the staff by using the gap and simply adding/subtracting above/below the staff.

Step 5) Label the notes and print out their pitch.

When we find notes we label them with a number from -2 to 10 and print out the corresponding notes assuming c major and treble clef and no accidentals. This is why it would be simple to map these numbers to other key signatures or clefs if we knew what those were.

User Interface:

Since this uses hough transform, users enter these parameters into the command line,

./hough sigma lo hi nBlur mincnt avgthresh in.png out.png

avgthresh is specific to our program, and determines the average darkness of the scan window at which we will consider something a note.

length startx starty endx endy slope staff# staffline

References: Professor Scharstein and "Computer Vision" by Shapiro and Stockman