UNSCRAMBLING C ODES: FROM H IEROGLYPHS TO M ARKET NEWS
UNSCRAMBLING C ODES: FROM H IEROGLYPHS TO M ARKET NEWS
UNSCRAMBLING C ODES: FROM H IEROGLYPHS TO M ARKET NEWS
6, December 2022
UNSCRAMBLING CODES:
FROM HIEROGLYPHS TO MARKET NEWS
ABSTRACT
This paper reviews some of the steps that paved the way for the development of sentiment analysis (or
opinion mining), a technique apparently used by Jim Simons’ Medallion fund for scoring an ‘impossible’
performance: a 66% annual average rate of return in the 31 years between 1988 and 2018. Sentiment
analysis is a powerful tool that uses natural language processing (NLP), or computational linguistics, to
determine whether a text about a company is positive, negative or neutral and, in a final analysis, to dis-
cover stock price patterns. Humans have always used symbols to communicate, plainly or secretively.
Here we review some of the methods used in the past centuries, including Egyptians’ hieroglyphs, Julius
Caesar’s cipher, Fibonacci’s abbreviations, Leonardo da Vinci’s Mirror Writing, Mary Stuart’s code.
The intention is to describe some passages of the long journey made by human beings to arrive at the
current sophisticated IT tools for sentiment analysis.
KEYWORDS
Natural Language Processing, Hedge Funds, Cryptography
1. INTRODUCTION
What do Jean François Champollion, French archaeologist and Egyptologist, and James (Jim)
Harris Simons, mathematician and hedge fund manager, have in common (Figure 1)? They have
both been code breakers. Champollion was able to unscramble the message encrypted in the Ro-
setta stone. He had studied the Coptic language, which derived from ancient Egyptian, and this
knowledge made it possible for him to understand the Demotic section of the Rosetta stone. Jim
Simons received a bachelor’s degree in mathematics from MIT, a PhD in mathematics from UC
Berkeley and held a research position at Harvard. He later became a breaker of Russian codes at
the Institute for Defense Analysis (IDA), during the Cold War.
John Hull [1] cites Jim Simons in the chapter on Natural Language Processing (NLP) of his
book on Machine Learning in Business:
New data sources are becoming available all the time. One approach is to try and
be one step ahead of most others in exploiting these new data sources. Another is
to develop better models than those being used by others and then be very secretive
about it. Renaissance Technologies, a hedge fund, provides an example of the se-
cond approach. It has been amazingly successful at using sophisticated models to
understand stock price patterns. Other hedge funds have been unable to replicate
its success. The average return of its flagship Medallion fund between 1988 and
2018 was 66% per year before fees. This included a return of close to 100% in
2008 when the S&P 500 lost 38.5%. Two senior executives, Robert Mercer and Pe-
ter Brown, are NLP experts and have been running the company following the re-
tirement of the founder, Jim Simons, in 2009. (p. 197)
DOI: 10.5121/ijnlc.2022.11603 39
International Journal on Natural Language Computing (IJNLC) Vol.11, No.6, December 2022
Source: Wikipedia
Figure 1 Jean-François Champollion and James (Jim) Harris Simons.
2. MEDALLION FUND
Gregory Zuckerman [2] in his bestselling book on The man who solved the market reported that
Simons quit Harvard in 1964 to join the Institute for Defense Analysis (IDA) “an elite research
organization that hired mathematicians from top universities to assist the National Security
Agency − the United States’ largest and most secretive intelligence agency − in detecting and
attacking Russian codes and ciphers. ... The IDA taught Simons how to develop mathematical
models to discern and interpret patterns in seemingly meaningless data.” (pp. 23-4).
While at IDA, Simons co-authored a paper on “Probabilistic Models for (and Prediction
of) Stock Market Behavior”. Detecting stock price patterns became one of his aims. However,
he apparently did not care about their interpretation:
“I don’t know why planets orbit the sun,” Simons told a colleague, suggesting one
needn’t spend too much time figuring out why the market’s patterns existed. “That
doesn’t mean I can’t predict them.” {[2], p. 151}
Whatever techniques used by Jim Simons, his results are astonishing (Appendix A):
There were compelling reasons I was determined to tell Simon’s story. A former
math professor, Simons is arguably the most successful trader in the history of
modern finance. Since 1988, Renaissance’s flagship Medallion hedge fund has
generated average annual returns of 66 percent, racking up trading profits of more
than $100 billion (see Appendix 1 for how I arrive at these numbers). No one in the
investment world comes close. Warren Buffett, George Soros, Peter Lynch, Steve
Cohen, and Ray Dalio all fall short (see Appendix 2). {[2], p. xvi}
Gross log-returns consistent with the data of Appendix A are reported in Table 1, together with
the correspondent data for the S&P 500 Total Return Index. The Medallion Fund (MF) outper-
formed the S&P 500 in 28 out of 31 years and never reported a loss. A $1 amount invested in
the Medallion Fund at the end of 1987 grew to $3,995,683 after 31 years!
Figure 2 reports the performance distributions of both investments (MF and S&P 500).
The average log-return, μ, of the Medallion Fund (49.0%) is 5 times greater than the average
log-return of the S&P 500 (9.7%), while the standard deviations, σ, are similar (18.7% and
17.0%, respectively). It will be impossible to beat this incredible performance!
40
International Journal on Natural Language Computing (IJNLC) Vol.11, No.6, December 2022
Table 1 Medallion Fund vs. S&P 500 Total Return Index: gross log-returns per year.
12 0.40
S&P 500 Total Return
Medallion Fund 0.35
10
N(9.7%, 17.0%)
0.30
Number of observations
N(49.0%, 18.7%)
8
0.25
Normal density
6 0.20
0.15
4
0.10
2
0.05
0 0.00
41
International Journal on Natural Language Computing (IJNLC) Vol.11, No.6, December 2022
16, 2020 (-12.0%, from 2,711.02 to 2,386.13) are called “6-sigma” or “six standard deviation”
events. Actually, if X is a standardized normal variable, the probability of X > 6.4
[=NORM.S.INV(1-1E-10) in Excel™] is equal to 0.0000000001, i.e. 1 out of 10 billion.
What is the probability of earning a 49.0% average annual log-return for 31 consecutive
years without never reporting a loss?
In the text that follows we will describe some passages of the long journey made by
human beings to communicate with each other, in a manifest or secret way. We will deal with
Egyptian Hieroglyphs (§3), Julius Caesar’s cipher (§4), Fibonacci’s Liber Abaci (§5), Leonardo
da Vinci’s Mirror Writing (§6), Mary Stuart’s Code (§6), Monte Carlo Descrambling (§8),
Shazam and Google Lens (§9), Kaggle (§10). Then we will conclude (§11).
3. EGYPTIAN HIEROGLYPHS
3.1 Rosetta Stone
The Rosetta Stone is a black granite stele inscribed with three versions of a decree issued in
Memphis, Egypt, in 196 BC during the Ptolemaic dynasty on behalf of King Ptolemy V
Epiphanes. It was discovered in July 1799 by French officer Pierre-François Bouchard during
the Napoleonic campaign in Egypt, while digging the foundations of an addition to a fort in the
Nile Delta, near el-Rashid (Italianized in Rosetta, which means “little rose”). It is an irregularly
shaped grey and pink granite stone [3 feet 9 inches (114 cm) long and 2 feet 4.5 inches (72 cm)
wide], exhibited in the British Museum since 1802 (Figure 3).
42
International Journal on Natural Language Computing (IJNLC) Vol.11, No.6, December 2022
The Rosetta Stone bears a priestly decree concerning Ptolemy V in three blocks of text:
Hieroglyphic (the “language of the gods”, 14 lines), Demotic (the “language of acts”, 32 lines)
and Ancient Greek (53 lines). Translation of the Ancient Greek was relatively easy. Translation
of Demotic (the ancient Egyptian script preceding Coptic) was harder, but the proper names
(Ptolemy, Alexander and Alexandria) were quickly deciphered. Translation of hieroglyphic text
was even harder, but it was soon established that names of kings or pharaohs were contained
within elongated ovals (“cartouches”).
43
International Journal on Natural Language Computing (IJNLC) Vol.11, No.6, December 2022
There were six cartouches on the Rosetta Stone. According to the Greek translation,
these cartouches clearly had to contain the name Ptolemy (Ptolemaios, in Greek). Three of them
looked as in Figure 4 and featured the name Ptolemy along with an Egyptian honorific: “Ptole-
my, living for ever, beloved of Ptah” {Robinson [4], p. 125}. The other three looked like this:
Hieroglyph
𓊪𓊪 𓏏𓏏 𓍯𓍯 𓃭𓃭 𓐝𓐝 𓇌𓇌 𓋴𓋴
Champollion’s reading P T O L M E S
The hieroglyphs of ancient Egypt are one of the earliest forms of writing. They can be read from
right to left or left to right. You can distinguish the direction in which the text is to be read be-
cause the human or animal figures always face towards the beginning of the line. Also the upper
symbols are read before the lower. There are no spaces between words, line breaks or punctua-
tion.
Google created tools and models for the three phases of the algorithm:
1) Extraction - taking hieroglyphic script and sequences from source images and creating
workable facsimiles;
2) Classification - training a neural network to correctly identify over 1000 hieroglyphs;
3) Translation - matching sequences and blocks of text to available dictionaries and published
translations.
44
International Journal on Natural Language Computing (IJNLC) Vol.11, No.6, December 2022
AEO A B CH D F PH V
𓄿𓄿 𓂝𓂝 𓃀𓃀 𓍿𓍿 𓂧𓂧 𓆑𓆑
vulture 1313F forearm 1309D foot 130C0 hobble rope 1337F hand 130A7 horned viper 13191
G H H IY J (G) K (C)
𓎼𓎼 𓎛𓎛 𓉔𓉔 𓇋𓇋 𓆓𓆓 𓎡𓎡
pot stand 133BC rope 1339B shelter 13254 reed leaf 131CB cobra 13193 basket 133A1
K (C) LR M N OUW P
𓏘𓏘 𓂋𓂋 𓅓𓅓 𓈖𓈖 𓅱𓅱 𓊪𓊪
hillside 133D8 open mouth 1308B owl 13153 water 13216 quail chick 13171 stool 132AA
S (C) SH T TH TH Z
𓋴𓋴 𓈙𓈙 𓏏𓏏 𓐍𓐍 𓄡𓄡 𓊃𓊃
folded cloth 132F4 lake 13219 bread loaf 133CF (unknown) 1340D cow’s belly 13121 door bolt 13283
45
International Journal on Natural Language Computing (IJNLC) Vol.11, No.6, December 2022
A similar cipher was used by the Emperor Augustus (23 September 63 BC - 19 August AD 14):
Figure 5 shows the Alberti cipher disk, an enciphering and deciphering tool developed in 1470
by Leon Battista Alberti. The device consists of two concentric circular plates mounted one on
top of the other.
Source: Wikipedia.
Figure 5 Alberti cipher disk.
46
International Journal on Natural Language Computing (IJNLC) Vol.11, No.6, December 2022
...
Manuscript| Liber Abaci, Leonardo Pisano, 1202 | Codice Magliabechiano, C. I, 2616, Badia Fiorentina, n. 73.
Incipit liber Abaci Compositus a leonardo filio Bonacij Pisano In Anno. M° cc° ij.°
S CRIPSISTis mihi domine mi magister Michael Scotte, summe philosophe, vt librum de numero, quem dudum
composui, uobis transcriberem: vnde uestrae obsecundans postulationi, ipsum subtiliori perscrutans Indagine ad
uestrum honorem et aliorum multorum utilitatem correxi. In cuius correctione quedam necessaria addidj, et quedam
superflua resecaui. In quo plenam numerorum doctrinam edidj, iuxta modum indorum, quem modum in ipsa scientia
prestantiorem elegi.
...
Incipit primum capitulum. Nouem figure indorum he sunt
9 8 7 6 5 4 3 2 1.
Cvm his itaque nouem figuris, et cum hoc signo 0, quod arabice zephirum appellatur, scribitur quilibet numerus, ut
inferius demonstratur.
“Decrypted” Latin | Boncompagni, B. [6], Il Liber Abbaci di Leonardo Pisano | Rome, 1857.
Here begins the Book of Calculation Composed by Leonardo Pisano, Family Bonaci, In the Year 1202.
You, my Master Michael Scott, most great philosopher, wrote to my Lord about the book on numbers which some
time ago I composed and transcribed to you; whence complying with your criticism, your more subtle examining cir-
cumspection, to the honor of you and many others I with advantage corrected this work. In this rectification I added
certain necessities, and I deleted certain superfluities. In it I presented a full instruction on numbers close to the
method of the Indians, whose outstanding method I chose for this science.
...
Here Begins the First Chapter. The nine Indian figures are:
9 8 7 6 5 4 3 2 1.
With these nine figures, and with the sign 0 which the Arabs call zephir any number whatsoever is written, as is
demonstrated below.
English | Sigler, L. E. [7], Fibonacci’s Liber Abaci, pp. 15 and 17, 2002.
47
International Journal on Natural Language Computing (IJNLC) Vol.11, No.6, December 2022
flipped to
A B C D E F G H I J K
L M N O P Q R S T U V
Note: The alphabet contains 5 blank characters that can be used as space
or word separator (this makes the frequency analysis more complex):
Source: https://www.dcode.fr/mary-stuart-code.
48
International Journal on Natural Language Computing (IJNLC) Vol.11, No.6, December 2022
The statistical distribution of letters in the English language, derived from some classic
books, is reported in Table 5.
Source: Art Owen [11], Stat 362 (Monte Carlo Methods), Stanford University, Fall 2008.
Note: The above statistics are derived from these classics (available at www.literature.org): Origin of
Species (Charles Darwin), The Voyage of the Beagle (Charles Darwin), Jane Eyre (Charlotte Bronte),
Wuthering Heights (Emily Bronte), Tarzan of the Apes (Edgar Rice Burroughs), The Return of Tarzan
(Edgar Rice Burroughs), Paradise Lost (John Milton). All upper-case letters were converted to lower-
case. All numbers and all special characters were removed. A carriage return is treated as a space. There
are exactly 5,086,936 characters in these files.
The Monte Carlo approach also requires the probabilities of adjacent letters, i.e. the (first-order)
letter transition probabilities:
A real-life example of Monte Carlo descrambling is shown by Persi Diaconis {[9], pp. 1-3}:
One day, a psychologist from the state prison system showed up with a collection
of coded messages. ... The problem was to decode these messages. Marc [Coram]
guessed that the code was a simple substitution cipher, each symbol standing for a
letter, number, punctuation mark or space. ... I like this example because a) it is
real, b) there is no question the [Monte Carlo] algorithm found the correct answer,
and c) the procedure works despite the implausible underlying assumptions. In
fact, the message is in a mix of English, Spanish and prison jargon. The plausibility
measure is based on first-order transitions only. A preliminary attempt with single-
letter frequencies failed.
49
International Journal on Natural Language Computing (IJNLC) Vol.11, No.6, December 2022
Marilyn Monroe Romy Schneider Romy Schneider Michel Piccoli Humphrey Bogart Jane Birkin
Arthur Miller Peter O'Toole Alain Delon Romy Schneider Lauren Bacall Serge Gainsbourg
George Harrison Jane Asher Mel Ferrer Jacqueline Kennedy Jim Morrison
Unrecognized
Pattie Boyd Paul McCartney Audrey Hepburn John Fitzgerald Kennedy Pamela Courson
Gregory Peck Marcello Mastroianni Richard Burton Audrey Hepburn Audrey Hepburn Gad Elmaleh
Ingrid Bergman Anna Karina Elizabeth Taylor Mel Ferrer Mel Ferrer Audrey Tautou
Neile Adams Natalie Wood Jack Nicholson Anna Karina Romy Schneider Monica Vitti
Steve McQueen Robert Redford Anjelica Huston Jean-Luc Godard Alain Delon Michelangelo Antonioni
Marilyn Monroe Alain Delon Alain Delon Claude Mann Cary Grant
Recognized
Eli Wallach Romy Schneider Nathalie Delon Jeanne Moreau Katharine Hepburn
Source: YouTube (https://youtu.be/ck57LnYScNQ), Shazam (Burt Bacharach, Live in Sidney, 2008) and
Google Lens.
10. KAGGLE
A huge amount of market news available on social media and elsewhere needs to be worked out
to discover if they will have a positive, negative, or neutral effect on a particular company. A
good starting point to learn Natural Language Processing (NLP) is to participate in a competi-
tion organized by Kaggle, a subsidiary of Google. One of the Kaggle’s “Getting Started” com-
petitions is titled “Natural Language Processing with Disaster Tweets: Predict which Tweets are
about real disasters and which ones are not.” All of the work can be done in Kaggle’s free, no-
setup, Jupyter Notebooks environment, where you can run Python code.
50
International Journal on Natural Language Computing (IJNLC) Vol.11, No.6, December 2022
The competition is ongoing. Table 7 shows a few steps of the current leader’s code.
51
International Journal on Natural Language Computing (IJNLC) Vol.11, No.6, December 2022
REFERENCES
[1] Hull, J. C., Machine Learning in Business: An Introduction to the World of Data Science, 3rd ed.,
KDP, May 26th, 2021.
[2] Zuckerman, G., The man who solved the market: How Jim Simons launched the quant revolution,
Penguin Random House, 2019.
[3] Rubinstein, M., “The World According to Mark Rubinstein: Interview”, Derivatives Strategy Maga-
zine, July 1999.
[4] Robinson, A., “Cracking the Egyptian Code: The Revolutionary Life of Jean-François Champol-
lion”, Thames & Hudson, 2018.
[5] Suetonius, G., The Twelve Caesars: The Lives of the Roman Emperors, J. C. Rolfe (Trans.). St. Pe-
tersburg, FL: Red and Black Publications, 2008.
[6] Boncompagni, B., Il Liber Abbaci di Leonardo Pisano, Rome, 1857.
[7] Leonardo da Pisa, Liber Abaci, 1202. See SIGLER, L. E., Fibonacci’s Liber Abaci - A Translation
into Modern English of Leonardo Pisano’s Book of Calculation, Springer, 2002.
[8] Singh, S., The Code Book: The Science of Secrecy from Ancient Egypt to Quantum Cryptography,
Fourth Estate, 1999.
[9] Diaconis, P., “The Markov Chain Monte Carlo Revolution,” Bulletin of the American Mathematical
Society, 46(2), 179-205, 2009.
[10] Wu, Y., et al., “Google’s neural machine translation system: Bridging the gap between human and
machine translation,” https://arxiv.org/pdf/1609.08144.pdf (2016).
[11] Owen, A., “Monte Carlo theory, methods and examples,” https://artowen.su.domains/mc/.
[12] Baker, M., and Wurgler, J., “Investor Sentiment and the Cross-Section of Stock Returns”, Journal of
Finance, Vol. 61, No. 4, 2006.
[13] Zhang, W., and Skiena, S., “Trading Strategies to Exploit Blog and News Sentiment”, Proceedings
of the Fourth International AAAI Conference on Weblogs and Social Media, 2010.
[14] John, R., and Govilkar, S., “Information Retrieval Technique for Web Using NLP”, International
Journal on Natural Language Computing, Vol. 6, No. 5, 2017.
52
International Journal on Natural Language Computing (IJNLC) Vol.11, No.6, December 2022
AUTHORS
Emilio Barone is a Stephen A. Ross Professor of Financial Economics, Intesa Sanpaolo
Chair, in the Department of Economics and Finance, Luiss Guido Carli University of
Rome. He has been an Executive Director at Intesa Sanpaolo, Head of the Financial
Risks Analysis Department and Head of Research at Istituto Mobiliare Italiano (I.M.I.).
He has previously worked as an economist in the Research Department of the Bank of
Italy. He has followed postgraduate studies at Yale University after receiving his
B.S./M.S. in Economics from “La Sapienza” University of Rome. He is the author of
many articles on derivatives pricing and risk measurement / management.
53
International Journal on Natural Language Computing (IJNLC) Vol.11, No.6, December 2022
APPENDIX A
THE MAN WHO SOLVED THE MARKET BY GREGORY ZUCKERMAN
Appendix 1
Net Management Performance Returns Size Medallion
Returns Fee* Fee Before Fees of Fund Trading Profits**
1988 9.0% 5% 20% 16.3% $20 million $3 million
1989 -4.0% 5% 20% 1.0% $20 million $0 million
1990 55.0% 5% 20% 77.8% $30 million $23 million
1991 39.4% 5% 20% 54.3% $42 million $23 million
1992 33.6% 5% 20% 47.0% $74 million $35 million
1993 39.1% 5% 20% 53.9% $122 million $66 million
1994 70.7% 5% 20% 93.4% $276 million $258 million
1995 38.3% 5% 20% 52.9% $462 million $244 million
1996 31.5% 5% 20% 44.4% $637 million $283 million
1997 21.2% 5% 20% 31.5% $829 million $261 million
1998 41.7% 5% 20% 57.1% $1.1 billion $628 million
1999 24.5% 5% 20% 35.6% $1.54 billion $549 million
2000 98.5% 5% 20% 128.1% $1.9 billion $2,434 million
2001 33.0% 5% 36% 56.6% $3.8 billion $2.149 million
2002 25.8% 5% 44% 51.1% $6.24 billion $2.676 billion
2003 21.9% 5% 44% 44.1% $5.09 billion $2.245 billion
2004 24.9% 5% 44% 49.5% $5.2 billion $2.572 billion
2005 29.5% 5% 44% 57.7% $5.2 billion $2.999 billion
2006 44.3% 5% 44% 84.1% $5.2 billion $4.374 billion
2007 73.7% 5% 44% 136.6% $6.2 billion $7.104 billion
2008 82.4% 5% 44% 152.1% $5.2 billion $7.911 billion
2009 39.0% 5% 44% 74.6% $5.2 billion $3.881 billion
2010 29.4% 5% 44% 57.5% $10 billion $5.750 billion
2011 37.0% 5% 44% 71.1% $10 billion $7.107 billion
2012 29.0% 5% 44% 56.8% $10 billion $5.679 billion
2013 46.9% 5% 44% 88.8% $10 billion $8.875 billion
2014 39.2% 5% 44% 75.0% $9.5 billion $7.125 billion
2015 36.0% 5% 44% 69.3% $9.5 billion $6.582 billion
2016 35.6% 5% 44% 68.6% $9.5 billion $6.514 billion
2017 45.0% 5% 44% 85.4% $10 billion $8.536 billion
2018 40.0% 5% 44% 76.4% $10 billion $7.643 billion
39.1% 66.1% $104.530.000.000
average average returns total
net returns before fees trading profits
* Fees are charged by the Medallion fund to its investors, which in most years represents the firm's own employees
and former employees.
** Gross returns and Medallion profits are estimates − the actual number could vary slightly depending on when the
annual asset fee is charged, among other things. Medallion's profits are before the fund's various expenses.
Average Annual Returns: 66.1% gross, 39.1% net
The above profits of $104.5 billion represent those of the Medallion fund. Renaissance also profits from three hedge
funds available to outside investors, which managed approximately $55 billion as of April 30, 2019. (Source: Medal-
lion annual reports; investors)
54
International Journal on Natural Language Computing (IJNLC) Vol.11, No.6, December 2022
Appendix 2
Returns Comparison
Investor Key Fund/Vehicle Period Annualized Returns*
Jim Simons Medallion Fund 1988-2018 39.1%
George Soros Quantum Fund 1969-2000 32.0%†
Steven Cohen SAC 1992-2003 30.0%
Peter Lynch Magellan Fund 1977-1990 29.0%
Warren Buffett Berkshire Hathaway 1965-2018 20.5%‡
Ray Dalio Pure Alpha 1991-2018 12.0%
* All returns are after fees.
† Returns have fallen in recent years as Soros has stopped investing money for others.
‡ Buffett averaged 62% gains investing his personal money from 1961 to 1957, starting with
less than $10,000, and saw average gains of 24.3$% for a partnership managed from 1957 to
1969.
Appendix 2 does not report the performance of Princeton/Newport Partners, the hedge fund of
Edward Thorp. However, Zuckerman mentions it elsewhere {[2], pp. 127-9}:
During the 1970s, Thorp helped lead a hedge fund, Princeton/Newport Partners,
recording strong gains and attracting well-known investors ... by the late 1980s,
Thorp’s fund stood at nearly $300 million, dwarfing the $25 million Simons’s Me-
dallion fund was managing at the time ... Over its nineteen-year existence, the
hedge fund featured annual gains averaging more than 15 percent (after charging
investors various fees), topping the market’s returns over that span.
55