{
"metadata": {
"language": "haskell",
"name": ""
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A few days ago, a friend of mine sent me [a fascinating problem](http://math.ucsd.edu/~justin/190hw.html). The problem goes like this:\n",
"\n",
"> The *homophony group* (of English) is the group with 26 generators `a`,`b`, `c`, and so on until `z` and one relation for every pair of English words which sound the same. Prove that the group is trivial!\n",
"\n",
"For example, consider the group elements **knight** and **night**. By the [cancellation laws](http://www.proofwiki.org/wiki/Cancellation_Laws), this implies that **k** must be the identity element. Recall that a trivial group is one which consists solely of its identity element, so our task is to show that each letter of the English alphabet is the identity element.\n",
"\n",
"Skipping all of the algebraic jargon, we want to show that if we set all homophones \"equal\" to one another, and do left cancellation, right cancellation, and substitution, we can show that all the English letters equal one.\n",
"\n",
"This is a fun exercise to do by hand, but I'd like to do it in Haskell. \n",
"\n",
"**Note: This work was done in [IHaskell](https://github.com/gibiansky/IHaskell), and what you're reading is the IHaskell notebook exported to HTML for viewing in the browser. You can download the original notebook [here](/blog/linguistics/homophony-groups/data/Homophony-Groups.ipynb).**\n",
"\n",
"I've started by compiling a list of homophones in American English, starting with [this list](http://members.peak.org/~jeremy/dictionaryclassic/chapters/homophones.php) and removing all single letters (such as `j` being a homophone with `jay`) and all words with apostrophes and periods, as well as some less commonly used words. You can [download my list](/blog/linguistics/homophony-groups/data/homophones.list), or make your own.\n",
"\n",
"The contents of the file look like this:\n",
"```\n",
"ad add\n",
"add ad\n",
"arc ark\n",
"ark arc\n",
"...\n",
"```\n",
"\n",
"Each line is a space-delimited list of words. The first word in the list sounds identical to all the remaining words in the list. This is why you see repeats - `ad` sounds like `add` but also `add` sounds like `ad`. This repetition isn't necessary, as we could do it programmatically, but is convenient.\n",
"\n",
"Let's go ahead and load this list:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import Control.Applicative ((<$>))\n",
"import Data.List.Utils (split)\n",
"\n",
"removeEmpty = filter (not . null)\n",
"homophones <- removeEmpty . map words . lines <$> readFile \"homophones.list\""
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 1
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's take a look at a few more of these homophones."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import Control.Monad (forM_)\n",
"import Data.List (intercalate)\n",
"\n",
"-- Show ten of the homophone sets\n",
"forM_ (take 10 homophones) $ \\ homs -> \n",
" putStrLn $ intercalate \"\\t\" homs"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "display_data",
"text": [
"adieu\tado\n",
"ado\tadieu\n",
"affect\teffect\n",
"aid\taide\n",
"aide\taid\n",
"ail\tale\n",
"air\terr\their\n",
"airs\terrs\theirs\n",
"aisle\tisle\n",
"ale\tail"
]
}
],
"prompt_number": 2
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that some of the sets have more than two elements, yet they are all on the same line.\n",
"\n",
"Let's convert this into a more usable format. We'll define a new type `WordPair` which represents a *single pair* of homophones, and convert this list into a list of `WordPair`s."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"data WordPair = WordPair String String\n",
" deriving Show\n",
"\n",
"-- Convert a list of homophones into a list of word pairs.\n",
"-- Note that the wordpairs should only use the first of the \n",
"-- list as the first word, since there will be repeat sets. \n",
"-- For instance, the set [\"a\", \"b\", \"c\"] would only generate \n",
"-- word pairs [WordPair \"a\" \"b\", WordPair \"a\" \"c\"].\n",
"pairs :: [String] -> [WordPair]\n",
"pairs (str:strs) = map (WordPair str) strs\n",
"\n",
"-- All pairs of words we consider homophones.\n",
"wordPairs = concatMap pairs homophones"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 3
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that we have this data in a usable form, let's use it to derive relations. \n",
"\n",
"The initial relations we have are simply the set of word pairs. However, we can use two operations in order to derive more relations:\n",
"\n",
"- `reduce`: The reduction operation will be the application of left and right cancellation laws. If a relation has the same thing on the left of both sides, we can take it off; same for the right side. This generates a new, simpler relation.\n",
"- `substitute`: The substitution operation will be substituting identity relations in. For instance, if we've derived that `d` is the identity element, then we can remove `d` from all known relations to get new, simpler relations.\n",
"\n",
"In addition to each relation storing what strings it considers equal, we'd also like to be able to track what operations led to the creation of that word pair. So before defining a relation, let's define a history data type:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"data History = Reduce String String\n",
" | Substitute Char\n",
" deriving Show"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 4
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, we'd like a relation to store all the transformations that were used to generate it, and also the two things it relates:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"data Relation = Relation [History] String String deriving Show\n",
"\n",
"-- We'd like equality to only look at the strings, not the history.\n",
"instance Eq Relation where\n",
" Relation _ s1 s2 == Relation _ t1 t2 =\n",
" s1 == t1 && s2 == t2"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 5
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Since `Relation` and `WordPair` are slightly different, let's convert all our `WordPair`s to `Relation`s. This gives us our initial set of relations, which we will use to derive all other relations."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"toRelation :: WordPair -> Relation\n",
"toRelation (WordPair first second) = Relation [] first second\n",
"\n",
"initRelations = map toRelation wordPairs"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 6
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Eventually, we're going to iteratively improve these relations until we have proven that all letters equal the identity. First, though, let's define our two operators, starting with `reduce`.\n",
"\n",
"When we `reduce` a relation, we apply the right and left cancellation laws. If we have the equation\n",
"$$ab = ac$$\n",
"we can use the left cancellation law to reduce it to $b = c$; similarly, using the right cancellation law, we can reduce the equation \n",
"$$xa = ya$$\n",
"to just $x = y$.\n",
"\n",
"Our `reduce` operator repeats these steps until it can no longer do so, and then the resulting strings are the reduced relation.\n"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"reduce :: Relation -> Relation\n",
"reduce rel@(Relation hist first second)\n",
" | canReduce first second = go (first, second)\n",
" \n",
" -- Note that we also have to be careful with the history.\n",
" -- If the `reduce` does nothing, then we do not want to add\n",
" -- anything to the history of the relation.\n",
" | otherwise = rel\n",
" \n",
" where\n",
" -- A reduction can happen if both strings are non-zero\n",
" -- and share a common first or last letter.\n",
" canReduce first second =\n",
" not (null first) &&\n",
" not (null second) &&\n",
" (head first == head second ||\n",
" last first == last second)\n",
" \n",
" -- Modified history including this reduction.\n",
" hist' = Reduce first second : hist\n",
" \n",
" -- Base case: if we've reduced a word pair to an empty string \n",
" -- and something else, we're done, as that something else\n",
" -- is equivalent to the identity element.\n",
" go (\"\", word) = Relation hist' word \"\"\n",
" go (word, \"\") = Relation hist' word \"\" \n",
" \n",
" go (first, second)\n",
" -- Chop off the first element if they're equal.\n",
" | head first == head second\n",
" = go (tail first, tail second)\n",
" \n",
" -- Chop off the last element if they're equal.\n",
" | last first == last second\n",
" = go (init first, init second)\n",
" \n",
" -- If netiher first nor last element are equal,\n",
" -- we've simplified the relation down as much\n",
" -- as we can simplify it.\n",
" | otherwise =\n",
" Relation hist' first second"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 7
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This looks pretty good. Next, let's define the `substitute` operator.\n",
"\n",
"The `substitute` operator removes a character from a relation. For instance, if we know that `d` is the identity, we can simplify the relation $$ad = dyd$$ to just $a = y$. \n",
"\n",
"Just like the `reduce` operator, we avoid modifying the `Relation`'s history if the `substitute` does nothing."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import Data.List.Utils (replace)\n",
"\n",
"-- Generate a new relation by removing characters we know to be \n",
"-- the identity. Make sure to update the history of the relation\n",
"-- with this substitution!\n",
"substitute :: Char -> Relation -> Relation\n",
"substitute char rel@(Relation hist first second)\n",
" | canSubstitute first second\n",
" = Relation (Substitute char : hist) (replaced first) (replaced second)\n",
" \n",
" | otherwise = rel\n",
" where\n",
" canSubstitute first second = char `elem` first || char `elem` second\n",
" replaced = replace [char] \"\""
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 8
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"With `substitute` implemented, we've finished all the machinery we're going to use for simplifying our relations. We're going to iteratively reduce and substitute until we've found that all the English letters are the identity element of the homophony group. We're still missing one thing, though - how do we know which letters we've proven to be the identity?\n",
"\n",
"Let's define a quick helper datatype for every identity we find. We're going to store the character that we've proven is the identity, as well as the history; that way, when we want to examine the results, we can see exactly how each letter was reduced to the identity."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"data FoundIdent = FoundIdent {\n",
" char :: Char,\n",
" hist :: [History]\n",
" }"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 9
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's also define a function that extracts all the identity elements from a set of relations."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"-- mapMaybe = map fromJust . filter isJust . map\n",
"import Data.Maybe (mapMaybe)\n",
"\n",
"identities :: [Relation] -> [FoundIdent]\n",
"identities = mapMaybe go\n",
" where\n",
" go :: Relation -> Maybe FoundIdent\n",
" go (Relation hist [char] \"\") = Just $ FoundIdent char hist\n",
" go (Relation hist \"\" [char]) = Just $ FoundIdent char hist\n",
" go _ = Nothing"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 10
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's finally put all of this together. We're going to start with our initial set of relations, `initRelations`, and then we're going to iteratively simplify them. Initially, we have no known identity elements.\n",
"\n",
"In each iteration, we\n",
"\n",
"- Substitute into each relation each known identity (replacing it with the empty string).\n",
"- Reduce the resulting relations.\n",
"- Collect all known identity elements."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import Data.List (nubBy)\n",
"import Data.Function (on)\n",
"\n",
"-- The iteration starts with a list of known identity elements\n",
"-- and the current set of relations. It outputs the updated \n",
"-- relations and all known identity elements.\n",
"iteration :: ([FoundIdent], [Relation]) -> ([FoundIdent], [Relation])\n",
"iteration (idents, relations) = (newIdents, newRelations)\n",
" where\n",
" -- Collect all the substitutions into a single function.\n",
" substitutions = foldl (.) id $ map (substitute . char) idents\n",
" \n",
" -- Do all substitutions, then reduce (for each relation).\n",
" newRelations = map (reduce . substitutions) relations\n",
"\n",
" -- We have to remove duplicate identity elements, because\n",
" -- in each iteration we find multiple ways to prove that some\n",
" -- letters are the identity element. We just want one.\n",
" removeDuplicateIdents =\n",
" nubBy ((==) `on` char)\n",
"\n",
" -- Find all identities in the new relations.\n",
" newIdents = removeDuplicateIdents $ idents ++ identities newRelations"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 11
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's iterate this process until we have all the identities we want. We want 26 of them, so we can just check the length. (If this operation never finishes, we're out of luck!)"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"-- Generate the infinite list of iterations and their results.\n",
"initIdents = []\n",
"iterations = iterate iteration (initIdents, initRelations)\n",
"\n",
"-- Define a completion condition.\n",
"-- We're done when there are 26 known identity elements.\n",
"done (idents, _) = length idents == 26\n",
"\n",
"-- Discard all iteration results until completion.\n",
"-- Take the next one - the first one where the condition is met.\n",
"result = head $ dropWhile (not . done) iterations"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 12
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Woohoo! We're *done*! Let's take a look at the results!"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import Data.List (sort)\n",
"\n",
"idents = fst result\n",
"identChars = map char idents\n",
"putStrLn $ sort identChars\n",
"length identChars"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "display_data",
"text": [
"abcdefghijklmnopqrstuvwxyz"
]
},
{
"metadata": {},
"output_type": "display_data",
"text": [
"26"
]
}
],
"prompt_number": 13
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Looks like we do indeed have every single letter mapped to the identity. \n",
"\n",
"Let's see if we can deduce, for each letter, how it was mapped to the identity. Instead of doing it in alphabetical order, we'll look at them in the order they were deduced, so it follows some logical flow."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import Text.Printf (printf)\n",
"\n",
"forM_ idents $ \\(FoundIdent char hist) -> do\n",
" printf \"Proving %c = 1:\\n\" char\n",
" forM_ (reverse hist) $ \\op ->\n",
" putStrLn $ case op of\n",
" Reduce first second -> \n",
" printf \"Reduce %s and %s\" first second\n",
" Substitute ch ->\n",
" printf \"Substitute %c for ''\" ch\n",
" putStr \"\\n\""
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "display_data",
"text": [
"Proving e = 1:\n",
"Reduce aid and aide\n",
"\n",
"Proving a = 1:\n",
"Reduce aisle and isle\n",
"\n",
"Proving u = 1:\n",
"Reduce ant and aunt\n",
"\n",
"Proving t = 1:\n",
"Reduce but and butt\n",
"\n",
"Proving n = 1:\n",
"Reduce cannon and canon\n",
"\n",
"Proving s = 1:\n",
"Reduce cent and scent\n",
"\n",
"Proving h = 1:\n",
"Reduce choral and coral\n",
"\n",
"Proving k = 1:\n",
"Reduce doc and dock\n",
"\n",
"Proving l = 1:\n",
"Reduce filet and fillet\n",
"\n",
"Proving w = 1:\n",
"Reduce hole and whole\n",
"\n",
"Proving b = 1:\n",
"Reduce plum and plumb\n",
"\n",
"Proving g = 1:\n",
"Reduce reign and rein\n",
"\n",
"Proving c = 1:\n",
"Reduce scent and sent\n",
"\n",
"Proving o = 1:\n",
"Reduce to and too\n",
"\n",
"Proving i = 1:\n",
"Reduce waive and wave\n",
"\n",
"Proving r = 1:\n",
"Reduce air and err\n",
"Substitute i for ''\n",
"Substitute a for ''\n",
"Substitute e for ''\n",
"\n",
"Proving d = 1:\n",
"Reduce awed and odd\n",
"Substitute o for ''\n",
"Substitute w for ''\n",
"Substitute a for ''\n",
"Substitute e for ''\n",
"\n",
"Proving y = 1:\n",
"Reduce bite and byte\n",
"Substitute i for ''\n",
"\n",
"Proving z = 1:\n",
"Reduce boos and booze\n",
"Substitute s for ''\n",
"Substitute e for ''\n",
"\n",
"Proving q = 1:\n",
"Reduce cask and casque\n",
"Substitute k for ''\n",
"Substitute u for ''\n",
"Substitute e for ''\n",
"\n",
"Proving x = 1:\n",
"Reduce coax and cokes\n",
"Substitute k for ''\n",
"Substitute s for ''\n",
"Substitute a for ''\n",
"Substitute e for ''\n",
"\n",
"Proving p = 1:\n",
"Reduce coo and coup\n",
"Substitute o for ''\n",
"Substitute u for ''\n",
"\n",
"Proving f = 1:\n",
"Reduce draft and draught\n",
"Substitute g for ''\n",
"Substitute h for ''\n",
"Substitute u for ''\n",
"\n",
"Proving m = 1:\n",
"Reduce damned and dammed\n",
"Substitute n for ''\n",
"\n",
"Proving j = 1:\n",
"Reduce genes and jeans\n",
"Substitute g for ''\n",
"Substitute n for ''\n",
"Substitute a for ''\n",
"Substitute e for ''\n",
"\n",
"Proving v = 1:\n",
"Reduce felt and veldt\n",
"Substitute l for ''\n",
"Substitute e for ''\n",
"Substitute f for ''\n",
"Substitute d for ''"
]
}
],
"prompt_number": 14
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you scan through the list above, there's a few weird cases, but for the most part, it seems legitimate. (I mildly question `felt` and `veldt`, but it depends on how you pronounce things. If you look at the British English list of homophones, it's totally different anyways!) \n",
"\n",
"So that's that! We've found the ways to reduce every letter to the identity, and shown how to do it.\n",
"\n",
"I wonder if other languages also have trivial homophony groups. It might be fun to try Spanish, French, Russian, and others, and see if the homophony groups tell us anything interesting about the language!"
]
}
],
"metadata": {}
}
]
}