Monday, April 5, 2010

Accuracy versus Precision

Although the two words can be synonymous in colloquial use, they are deliberately contrasted in the context of the scientific method.

Imagine my dismay when I listened to Ancestry.com explain their criteria for accepting data to post when they used the word ACCURACY. I think this is the core of my problem with Ancestry.com. This is probably a flawed side effect from working with engineering configurations for over 30+ years.

My suggested solution for Genealogist as a whole is to adopt a modified ISO 8000. Which is the international standard for data quality.

There are several well-known authors and self-styled experts, with Larry English perhaps the most popular. In addition, the International Association for Information and Data Quality (IAIDQ) was established in 2004 to provide a focal point for professionals and researchers in this field.

The standard governs data collection with several considerations including but not limited to:

1. Rules management
2. Metadata Verification and validation
3. Database sanitation including redundancy
4. Data Profiling

Rules management: Many genealogy software programs have set internal rule engines. Rule engine software is commonly provided as a component of a business rule management system which, among other functions, provides the ability to: register, define, classify, and manage all the rules, verify consistency of rules definitions. Examples Mothers must be at least 13 years of age. People do not have children over the age of 80. Parents do not marry their children.


Metadata Verification and validation: Assess whether metadata accurately describes the actual values in the source database.

Database sanitation including redundancy: Understanding data challenges early in any data intensive project, so that late tree project surprises are avoided. Finding data problems late in your research has caused more than one "professional level" genealogist to dump an entire lineage. (I purposely omitted the names here to protect the guilty, but you know who you are!!)

Therefore I call this data sanitation a phrase I made up when I was asked to analysis large data bases with data not monitored and controlled by myself. My boss would ask when would the report be ready --- depending on the missing data, incorrect data, typos, the answer would vary. Some genealogy programs help with this process by running "possible problem reports" for you. But each "suggestion must be examined carefully and evaluated as to how any changes would effect dependent joined data.

Rather than using any family tree on Ancestry.com or independent web sites. I have found it less time consuming to start from scratch and enter data independently. I often find by the 50th entry I run into a "not possible, probably not correct" data point.

Data Profiling: Several genealogy software programs helps with this by setting up quick easy to read charts and reports. Giving us for examples
Frequency counts: most of my ancestors should be in United States; I have no one born in Japan in my family database history so the largest number of occurrences of code should be USA. I should have no one born in Japan.

Although this seems bogus I have seen Names of Cities mistakenly entered as name of countries,
Statistics:
Minimum value: such as youngest person at death
Maximum value: example oldest person at death
Mean value (average): average age at marriage
Median value: All Civil War veterans lived within 1860----1870 or longer
Standard deviation

As in many scientific disciplines popular opinion does not make it FACT. When I realize "the experts" are basing their findings on popular vote (i.e. One of the Worse Offenders ONE WORLD CONNECT)I am reminded of the story of The Emperor's New Clothing.

It takes every muscle in my body to clamp my mouth shut. My Dearman and Chapyn lineage agree with no other lineage I have seen posted yet. And further more I am okay with that.

In the fields of engineering, industry and statistics, the accuracy of a measurement system is the degree of closeness of measurements of a quantity to its actual (true) value. The precision of a measurement system, also called reproducibility or repeatability, is the degree to which repeated measurements under unchanged conditions show the same results. Ergo if all family historians researched their own data, copied from Family Bibles, land deeds, census records, birth and death records we could use that data to do just as Ancestry.com suggests. Calculate precision based on reproducibility and repeatability. BUT if 20 family historians go in and copy unvalidated, unverified, family trees and then use loose rules of Design of Experiment. We have a very unstable database to almost being to the point as useless.

Then for the LDS to indicate that this will be the basis for their WORLD TREE.......HELP definition of stress: the energy used to prevent myself from screaming at these presentations!!!!

1 comment:

  1. The difference between LDS' approach to family trees and Ancestry.com's is that the latter makes no claim as to accuracy.

    On Ancestry.com any user can submit anything at all, including material lifted from LDS's highly error-prone IGI.

    In LDS' approach, the matter of what to do when relationships are disproved after rituals performed has not been resolved. Unwillingness to face this may be what is behind removing Temple Records again to a separate database.

    ReplyDelete