First really useful test:
It's 21 tests, and there's an error somewhere in the 13th test, it only adds up to 96. A very simple averaging gives:
None ( ): 30.00%
Well-crafted (-): 30.57%
Finely-crafted (+): 17.43%
Superior Quality (*): 19.29%
Exceptional (≡): 2.48%
Masterwork (☼): 0.05%
The interesting thing that leaps out is that Superior (*) is actually more likely than Finely-crafted (+) in this test sample. I'm curious as to whether this is a result of an error in the test (unlikely; it isn't the result of a single error skewing things, as just under 62% of the sub-trials had more Superior (*) than Finely-crafted (+)), the result of too few trials (possible, but things seemed to be otherwise converging reasonably well), or is telling us something we didn't know about the way quality is determined (possibly including a bug in DF).
It would make sense for the odds to have been originally set to something like 30%, 30%, 20%, 20%-small, small, miniscule; except it looks like if something along those lines was done, the chance for the high end results was pulled out of Finely-crafted (+) rather than Superior (*) as would have been the typical sort of RPG progression.
The fact that there was a single Masterwork (☼) crafted is also interesting, although the odds are probably no better than 1/1000.
It's not clear to me from the test description whether the dwarf could gain skill during the 100 bin runs; if skill gain happened, then the discontinuity is probably the result of changes in the output likelihood with increased skill. What we probably need, if this isn't already that, is to collect statistics on runs just short of the amount needed to level, manually produce one or a few more un-counted ones to kick to the next skill level, and then start counting as a separate group until just short of the next level, etc. (There are some other unanswered questions buried in this which the above is intended to avoid; for instance, when producing an item that will result in a skill up, is the quality rolled at the lower or higher skill chances? We'd need the results uncontaminated by this first, and then a very large number of only skill-up-items recorded, at a level where the chances on either side are distinguishable; not really practical.)