Step 0: Encryption
Obtain an English plaintext message of 50,000 plaintext characters, where the characters consist only of lowercase a through z (i.e., remove all punctuation, special characters, and spaces, and convert all upper case to lower case). Set a key yourself and encrypt this plaintext using a simple substitution cipher.
Step 1: Generate a digraph frequency matrix A for English text
Use an English text with 1,000,000 characters (only 26 letters in lowercases) to compute a digraph frequency matrix A, where aij is the count of the number of times that letter i is followed by letter j. Here, we assume that a is letter 0, b is letter 1, c is letter 2, and so on. Next, add five to each element in your 26×26 matrix A. Finally, normalize your matrix A by dividing each element by its row sum. The resulting matrix A will be row stochastic, and it will not contain any 0 probabilities.
Step 2: Train an HMM with M = N = 26
Using the first 1000 characters of ciphertext you generated in step 0 as O to train an HMM, where the A matrix is initialized with your A matrix from step 1. That is, in your HMM, do not re-estimate A. For B & pi, randomly initialize them so each element is about 1/26 (recall they also need to be row stochastic! and do NOT make all elements the same). Train using 200 iterations of the Baum-Welch re-estimation algorithm.
Step 3: Analyze Result
Use the final B matrix you get from step 2 to determine a putative key and give the fraction of putative key elements that match the actual key (as a decimal, to four places). For example, if 22 of the 26 key positions are correct, then your answer would be 22/26 = 0.8462. If the accuracy is low (<80%), you can do several tries.
- Your source code (in any language you like)
- A report in .doc or .pdf that includes screenshot(s) of the result, and your analysis. The format doesn’t matter, just make sure the screenshots are clear and easy to read.