- Moacir P. de Sá Pereira
This python script breaks up a text into its internal sections. It uses a light markup scheme to signal where chapters and sections begin, and it also can keep track of dialogue by speaker. Given an electronic version of The Great Gatsby, for example, after the markup, it is possible to extract only Tom Buchanan’s lines.
The markup that breaks out the sections and dialogue was created by David Hoover, though the entirety of Prof. Hoover’s markup scheme has not been implemented here.
Read the README at GitHub.
The current state of the markup is:
<1> text division level 1 (chapter, say) <2> text division level 2 (subchapter, say) / new speaker (character) \ reporting clause (“speech marker”)
As a result, the opening of The Great Gatsby can be marked up as:
<1>The Great Gatsby Then wear the gold hat, if that will move her; If you can bounce high, bounce for her too, Till she cry "Lover, gold-hatted, high-bouncing lover, I must have you!" THOMAS PARKE D'INVILLIERS. <2>Chapter 1 In my younger and more vulnerable years my father gave me some advice that I've been turning over in my mind ever since. /Mr. Carraway"Whenever you feel like criticizing any one," \he told me, /Mr. Carraway"just remember that all the people in this world haven't had the advantages that you've had." [...] <2>Chapter 2
Here, I have arbitrarily designated the novel itself as
level 1 of the text
division, thereby making each chapter
level 2. When Mr. Carraway speaks, his
speech is introduced with
/Mr. Carraway, and the reporting clause is marked
with a backslash. Every aspect of the markup, of course, is optional, so if you
want to keep the reporting clause as part of the narration, just don’t use the
backslash. If you want to skip dialogue by certain characters or in certain
parts, just don’t mark them up. There is a
on the GitHub project that is a bit longer than the example above.
The script expects dialog to take the form of
ascii double quotes (
dialogue"), though it also recognizes curly quotes (
“some dialogue”). It
" as the barrier that stops the name of the character
Mr. Carraway above). Other dialogue markers require some extra
There are no closing tags, because the script resets dialogue and reporting
clauses on blank lines. As a result, a paragraph gets broken up, as above.
Since, presumably, the text will be fed into
some other processing environment, the lack of paragraph integrity should not
be a cause for concern. Similarly, a new
<2> treats the previous one as
closed, much like
html does with
text_divider was designed to be quick to use, as I wanted simply to pull out
dialogue from novels with character attributes, without having to create a
whole TEI version of the novel. This markup, with some useful vim macros (see
the README) lets me markup about 100 pp of text in an hour or so, which is a
pretty quick way of building up something to feed the processor for