Andrew Gibiansky   ::   Math → [Code]

Homophony Groups in Haskell

Friday, January 17, 2014

A few days ago, a friend of mine sent me a fascinating problem. The problem goes like this:

The homophony group (of English) is the group with 26 generators a,b, c, and so on until z and one relation for every pair of English words which sound the same. Prove that the group is trivial!

For example, consider the group elements knight and night. By the cancellation laws, this implies that k must be the identity element. Recall that a trivial group is one which consists solely of its identity element, so our task is to show that each letter of the English alphabet is the identity element.

Skipping all of the algebraic jargon, we want to show that if we set all homophones "equal" to one another, and do left cancellation, right cancellation, and substitution, we can show that all the English letters equal one.

This is a fun exercise to do by hand, but I'd like to do it in Haskell.

Note: This work was done in IHaskell, and what you're reading is the IHaskell notebook exported to HTML for viewing in the browser. You can download the original notebook here.

I've started by compiling a list of homophones in American English, starting with this list and removing all single letters (such as j being a homophone with jay) and all words with apostrophes and periods, as well as some less commonly used words. You can download my list, or make your own.

The contents of the file look like this:

ad add
add ad
arc ark
ark arc

Each line is a space-delimited list of words. The first word in the list sounds identical to all the remaining words in the list. This is why you see repeats - ad sounds like add but also add sounds like ad. This repetition isn't necessary, as we could do it programmatically, but is convenient.

Let's go ahead and load this list:

In [1]:
import Control.Applicative ((<$>))
import Data.List.Utils     (split)

removeEmpty = filter (not . null)
homophones <- removeEmpty . map words . lines <$> readFile "homophones.list"

Let's take a look at a few more of these homophones.

In [2]:
import Control.Monad (forM_)
import Data.List     (intercalate)

-- Show ten of the homophone sets
forM_ (take 10 homophones) $ \ homs -> 
  putStrLn $ intercalate "\t" homs
adieu	ado
ado	adieu
affect	effect
aid	aide
aide	aid
ail	ale
air	err	heir
airs	errs	heirs
aisle	isle
ale	ail

Note that some of the sets have more than two elements, yet they are all on the same line.

Let's convert this into a more usable format. We'll define a new type WordPair which represents a single pair of homophones, and convert this list into a list of WordPairs.

In [3]:
data WordPair = WordPair String String
                deriving Show

-- Convert a list of homophones into a list of word pairs.
-- Note that the wordpairs should only use the first of the 
-- list as the first word, since there will be repeat sets. 
-- For instance, the set ["a", "b", "c"] would only generate 
-- word pairs [WordPair "a" "b", WordPair "a" "c"].
pairs :: [String] -> [WordPair]
pairs (str:strs) = map (WordPair str) strs

-- All pairs of words we consider homophones.
wordPairs = concatMap pairs homophones

Now that we have this data in a usable form, let's use it to derive relations.

The initial relations we have are simply the set of word pairs. However, we can use two operations in order to derive more relations:

  • reduce: The reduction operation will be the application of left and right cancellation laws. If a relation has the same thing on the left of both sides, we can take it off; same for the right side. This generates a new, simpler relation.
  • substitute: The substitution operation will be substituting identity relations in. For instance, if we've derived that d is the identity element, then we can remove d from all known relations to get new, simpler relations.

In addition to each relation storing what strings it considers equal, we'd also like to be able to track what operations led to the creation of that word pair. So before defining a relation, let's define a history data type:

In [4]:
data History = Reduce String String
             | Substitute Char
             deriving Show

Now, we'd like a relation to store all the transformations that were used to generate it, and also the two things it relates:

In [5]:
data Relation = Relation [History] String String deriving Show

-- We'd like equality to only look at the strings, not the history.
instance Eq Relation where
  Relation _ s1 s2 == Relation _ t1 t2 =
    s1 == t1 && s2 == t2

Since Relation and WordPair are slightly different, let's convert all our WordPairs to Relations. This gives us our initial set of relations, which we will use to derive all other relations.

In [6]:
toRelation :: WordPair -> Relation
toRelation (WordPair first second) = Relation [] first second

initRelations = map toRelation wordPairs

Eventually, we're going to iteratively improve these relations until we have proven that all letters equal the identity. First, though, let's define our two operators, starting with reduce.

When we reduce a relation, we apply the right and left cancellation laws. If we have the equation $$ab = ac$$ we can use the left cancellation law to reduce it to $b = c$; similarly, using the right cancellation law, we can reduce the equation $$xa = ya$$ to just $x = y$.

Our reduce operator repeats these steps until it can no longer do so, and then the resulting strings are the reduced relation.

In [7]:
reduce :: Relation -> Relation
reduce rel@(Relation hist first second)
  | canReduce first second = go (first, second)
  -- Note that we also have to be careful with the history.
  -- If the `reduce` does nothing, then we do not want to add
  -- anything to the history of the relation.
  | otherwise = rel
    -- A reduction can happen if both strings are non-zero
    -- and share a common first or last letter.
    canReduce first second =
      not (null first)  &&
      not (null second) &&
      (head first == head second ||
       last first == last second)
    -- Modified history including this reduction.
    hist' = Reduce first second : hist
    -- Base case: if we've reduced a word pair to an empty string 
    -- and something else, we're done, as that something else
    -- is equivalent to the identity element.
    go ("", word) = Relation hist' word ""
    go (word, "") = Relation hist' word "" 
    go (first, second)
      -- Chop off the first element if they're equal.
      | head first == head second
      = go (tail first, tail second)
      -- Chop off the last element if they're equal.
      | last first == last second
      = go (init first, init second)
      -- If netiher first nor last element are equal,
      -- we've simplified the relation down as much
      -- as we can simplify it.
      | otherwise =
          Relation hist' first second

This looks pretty good. Next, let's define the substitute operator.

The substitute operator removes a character from a relation. For instance, if we know that d is the identity, we can simplify the relation $$ad = dyd$$ to just $a = y$.

Just like the reduce operator, we avoid modifying the Relation's history if the substitute does nothing.

In [8]:
import Data.List.Utils (replace)

-- Generate a new relation by removing characters we know to be 
-- the identity. Make sure to update the history of the relation
-- with this substitution!
substitute :: Char -> Relation -> Relation
substitute char rel@(Relation hist first second)
  | canSubstitute first second
  = Relation (Substitute char : hist) (replaced first) (replaced second)
  | otherwise = rel
    canSubstitute first second = char `elem` first || char `elem` second
    replaced = replace [char] ""

With substitute implemented, we've finished all the machinery we're going to use for simplifying our relations. We're going to iteratively reduce and substitute until we've found that all the English letters are the identity element of the homophony group. We're still missing one thing, though - how do we know which letters we've proven to be the identity?

Let's define a quick helper datatype for every identity we find. We're going to store the character that we've proven is the identity, as well as the history; that way, when we want to examine the results, we can see exactly how each letter was reduced to the identity.

In [9]:
data FoundIdent = FoundIdent {
    char :: Char,
    hist :: [History]

Let's also define a function that extracts all the identity elements from a set of relations.

In [10]:
-- mapMaybe = map fromJust . filter isJust . map
import Data.Maybe (mapMaybe)

identities :: [Relation] -> [FoundIdent]
identities = mapMaybe go
    go :: Relation -> Maybe FoundIdent
    go (Relation hist [char] "") = Just $ FoundIdent char hist
    go (Relation hist "" [char]) = Just $ FoundIdent char hist
    go _ = Nothing

Let's finally put all of this together. We're going to start with our initial set of relations, initRelations, and then we're going to iteratively simplify them. Initially, we have no known identity elements.

In each iteration, we

  • Substitute into each relation each known identity (replacing it with the empty string).
  • Reduce the resulting relations.
  • Collect all known identity elements.
In [11]:
import Data.List     (nubBy)
import Data.Function (on)

-- The iteration starts with a list of known identity elements
-- and the current set of relations. It outputs the updated 
-- relations and all known identity elements.
iteration :: ([FoundIdent], [Relation]) -> ([FoundIdent], [Relation])
iteration (idents, relations) = (newIdents, newRelations)
    -- Collect all the substitutions into a single function.
    substitutions = foldl (.) id $ map (substitute . char) idents
    -- Do all substitutions, then reduce (for each relation).
    newRelations = map (reduce . substitutions) relations

    -- We have to remove duplicate identity elements, because
    -- in each iteration we find multiple ways to prove that some
    -- letters are the identity element. We just want one.
    removeDuplicateIdents =
      nubBy ((==) `on` char)

    -- Find all identities in the new relations.
    newIdents = removeDuplicateIdents $ idents ++ identities newRelations

Let's iterate this process until we have all the identities we want. We want 26 of them, so we can just check the length. (If this operation never finishes, we're out of luck!)

In [12]:
-- Generate the infinite list of iterations and their results.
initIdents = []
iterations = iterate iteration (initIdents, initRelations)

-- Define a completion condition.
-- We're done when there are 26 known identity elements.
done (idents, _) = length idents == 26

-- Discard all iteration results until completion.
-- Take the next one - the first one where the condition is met.
result = head $ dropWhile (not . done) iterations

Woohoo! We're done! Let's take a look at the results!

In [13]:
import Data.List (sort)

idents = fst result
identChars = map char idents
putStrLn $ sort identChars
length identChars

Looks like we do indeed have every single letter mapped to the identity.

Let's see if we can deduce, for each letter, how it was mapped to the identity. Instead of doing it in alphabetical order, we'll look at them in the order they were deduced, so it follows some logical flow.

In [14]:
import Text.Printf (printf)

forM_ idents $ \(FoundIdent char hist) -> do
  printf "Proving %c = 1:\n" char
  forM_ (reverse hist) $ \op ->
    putStrLn $ case op of
      Reduce first second -> 
        printf "Reduce %s and %s" first second
      Substitute ch ->
        printf "Substitute %c for ''" ch
  putStr "\n"
Proving e = 1:
Reduce aid and aide

Proving a = 1:
Reduce aisle and isle

Proving u = 1:
Reduce ant and aunt

Proving t = 1:
Reduce but and butt

Proving n = 1:
Reduce cannon and canon

Proving s = 1:
Reduce cent and scent

Proving h = 1:
Reduce choral and coral

Proving k = 1:
Reduce doc and dock

Proving l = 1:
Reduce filet and fillet

Proving w = 1:
Reduce hole and whole

Proving b = 1:
Reduce plum and plumb

Proving g = 1:
Reduce reign and rein

Proving c = 1:
Reduce scent and sent

Proving o = 1:
Reduce to and too

Proving i = 1:
Reduce waive and wave

Proving r = 1:
Reduce air and err
Substitute i for ''
Substitute a for ''
Substitute e for ''

Proving d = 1:
Reduce awed and odd
Substitute o for ''
Substitute w for ''
Substitute a for ''
Substitute e for ''

Proving y = 1:
Reduce bite and byte
Substitute i for ''

Proving z = 1:
Reduce boos and booze
Substitute s for ''
Substitute e for ''

Proving q = 1:
Reduce cask and casque
Substitute k for ''
Substitute u for ''
Substitute e for ''

Proving x = 1:
Reduce coax and cokes
Substitute k for ''
Substitute s for ''
Substitute a for ''
Substitute e for ''

Proving p = 1:
Reduce coo and coup
Substitute o for ''
Substitute u for ''

Proving f = 1:
Reduce draft and draught
Substitute g for ''
Substitute h for ''
Substitute u for ''

Proving m = 1:
Reduce damned and dammed
Substitute n for ''

Proving j = 1:
Reduce genes and jeans
Substitute g for ''
Substitute n for ''
Substitute a for ''
Substitute e for ''

Proving v = 1:
Reduce felt and veldt
Substitute l for ''
Substitute e for ''
Substitute f for ''
Substitute d for ''

If you scan through the list above, there's a few weird cases, but for the most part, it seems legitimate. (I mildly question felt and veldt, but it depends on how you pronounce things. If you look at the British English list of homophones, it's totally different anyways!)

So that's that! We've found the ways to reduce every letter to the identity, and shown how to do it.

I wonder if other languages also have trivial homophony groups. It might be fun to try Spanish, French, Russian, and others, and see if the homophony groups tell us anything interesting about the language!