2. Collecting, Sifting, Sorting, and Matching
17 18 19 20 21 Page 22 23 24 25

1 Leave a comment on paragraph 1 0 The transcribed records showed almost perfect format consistency, a requirement for the five-step computer analysis used:9

1.  Preprocessing: Individual records were distinguished by a date or a page number; section headers do not begin with a date, a simple integer (page number), or a leading blank after a carriage return. Line continuations were given sixteen leading blanks after a carriage return; in this case the carriage return, followed by fifteen blanks, were deleted globally. Underlines, bold attributes, and blank lines were removed; editorial comments by the transcriber were also erased. 

2.  Parsing: Section headers were filtered out and stored in a first table. Prose was broken down into individual records, beginning with a date and/or a page number. Records were stored in a second table and tagged with the appropriate section-header index and the date. Where the record started on a page number only, the date of the previous record was carried over.

3.  Fractionating: This was the back-breaking, largely manual part of the work. For each record, the monikers were identified and individually entered into a third table if they were new. The relationship between a specific moniker and a record, together with flags and category attributes, were entered into a fourth, linkage table. Especially important records were flagged, and any sign that the person was deceased (e.g., mention of heirs, widow, etc.) was flagged as well. The type of activity or event was classified in one of fourteen categories. A comment field allows for additional specifications. Fortunately, after an initial learning period, the program recognized repeating monikers in the prose and selected them automatically. We call these linkage records ‘events.’

4.  Filtering and Sorting:Events were sorted by date and then filtered by moniker. For a given moniker, we tried to construct the likeliest formal name and checked if this name already existed. Subsequently, events were linked to this formal name in batches (highlight and click). A number of patterns helped in this process: identical monikers in a given time window; same location, same position in tax list; mention of family links (e.g., brother of, father of, daughter of, etc.); estate settlements, etc.

5.  Matching and reassembling: Once some key moniker patterns had established an identifiable person, remaining events were scanned for partial patterns that fit this person, and these events were also tagged with the same identity. Sometimes the emergence of a new pattern necessitated the release of some events from a person to whom they had already been assigned, thus resulting in an iterative process. 

7 Leave a comment on paragraph 7 0 Steps 3 and 5 introduced yet another large source of subjectivity.

8 Leave a comment on paragraph 8 0  

9Data was stored in a relational database, first in Microsoft SQL-Express, later in MYSQL; analysis software was written first in C# and later in Java by the author.

Page 22

Source: https://www.stuehlingen.online/Book/?page_id=1076