This page actully walks you through marking up "The Voyage of the Beagle" by Charles Darwin. You may want to down load the e-text, and follow along. Here are some general tips which are all really commonsense.
Down load the document and save it. Make sure you have your text editor and XML parser handy. Here the operation will be described using EditPad and IE5. (see "Tools of the Trade")
Open up the text vbgle10.txt file in your text editor and save it as vbgle10.xml.
We will want to check at regular intervals that our document is well formed. The best way to do this is to load it with an
xml mime type rather than an
html mime type. Checking for errors is much more thorough in an xml browser, and we don't want any errors! To check for an error of well formedness we just have to open the document in IE5, and a check is automatically carried out.
search for & and < characters. Replace these with their entities.& and <
The & and < characters have special meaning in XML, we therefore need to replace them with their entities
<. The easiest way to do this is with EditPads search and replace function.
look for concurrent new lines (use \n\n in Edit text) replace with \n</p>\n<p>\n
This will divide the document up into various sections. You may find that there are 'empty'
<p></p> elements. find these with the search and replace function, and get rid of them. Of course we will be starting of with a
</p>, and ending with a
<p> so we need to fix this by:
move the final <p> to the begining
This is not yet a well-formed XML document, because there is no root element that encloses the whole document (
<html> in XHTML), so we need to fix this now.
Step5: Add the following to the front of the text
<html> <head> <title>HWG Gutenberg The Voyage of the Beagle by Darwin </title> </head> <body>
and add this to the end
We now have a valid XHTML file! Open it in IE5 as an XML file to test for well formedness
If there are any errors fix them (there shouldn't be at this stage!). IE5 will tell you the nature of the error and the line number where the error became apparent. Remember that this may not be the line that contains the error, the error may be several lines back!
Many proprietary HTML editors can save a text document as HTML. We could then use HTML tidy to convert the document to xml. However many of these tools have difficulty handling large text files (for example vbgle10.txt freezes Word 97), and there is often so much 'Junk' in the document that that needs to be pruned out, that it is not worth the effort. Howver by all means go ahead and experiment using your favorite HTML editor!
Step1: Open up the text vbgle10.txt file in your text editor and save it as vbgle10.xml step2: search for & and < characters. Replace these with their entities.& and < step3: look for concurrent new lines (use \n\n in Edit text) replace with \n</p>\n<p>\n step4: move the final <p> to the begining Step5: Add the following to the front of the text <html> <head> <title>HWG Gutenberg The Voyage of the Beagle by Darwin </title> </head> <body> <p> Project Gutenberg's Etext of The Voyage of the Beagle by Darwin #1 in our series by Charles Darwin and add this to the end End of Project Gutenberg's Etext of The Voyage of the Beagle by Darwin </p> </body> </html> We now have a valid XHTML file! Open it in IE5 as an XML file to test for well formedness If there are any errors fix them (there shouldn't be at this stage!) step6: Add the major division classes <div class="gutblurb"> <div class="revhist"> <div class="book"> <div class="frontmatter"> <div class="bookbody"> <div class="backmatter"> <div class="endgutblurb"> As you add each one, check them for well formedness in IE5 stepx: A quick perusal shows that each chapter contains the word CHAPTER *. Use the 'find' function of your text editor to isolate the start of each chapter and mark it up. We suggest that you first change the 'p' elements to something more suitable, and then you add the 'div class=' elements. The last one to add is the class="chapter". Once this is added procede to the next chapter making sure that you add the closing tag. Here is what the markup looks like after we have finished marking up the beginning of chapter 1 <div class="bookbody"> <div class="chapter"> <div class="chapnumber"> <h2> CHAPTER I </h2> </div><!--end of class="chapnumber"--> <div class="chaptitle"> <h3> ST. JAGO -- CAPE DE VERD ISLANDS </h3> </div><!--end of class="chaptitle"--> <div class="chapsummary"> <p> Porto Praya -- Ribeira Grande -- Atmospheric Dust with Infusoria -- Habits of a Sea-slug and Cuttle-fish -- St. Paul's Rocks, non-volcanic -- Singular Incrustations -- Insects the first Colonists of Islands -- Fernando Noronha -- Bahia -- Burnished Rocks -- Habits of a Diodon -- Pelagic Confervae and Infusoria -- Causes of discoloured Sea. </p> </div><!--end of class="chapsummary"--> Check for well formedness by loading it into IE5 before proceding! Note that we have added an additional class chapsummary. We need to record this. stepx: We decide to add a new class 'chapsummary' to enclose the summary at the beginning of each chapter. We must thus include it in the revhist section as below. <div class="revhist"> <pre> Initial Marker Frank Boumphrey email firstname.lastname@example.org Date: 1/14/00 new classes chapsummary: the summary of a chapter coming after the title </pre> </div> stepx: You may find it easier (I do) to use the search function to step through all the chapters and markup each piece individually. i.e. first do all the class="chapter" then all the class="chapnumber" etc.