Lecture 6

Download as key, pdf, or txt
Download as key, pdf, or txt
You are on page 1of 26

Ling 4807: Applications of

Computer in Linguistics
Farig Sadeque
Assistant Professor
Computer Science and Engineering
BRAC University
Development of Bangla language
technology: scope and necessity
Talking points

Summary of the current status


Components
Spell and grammar checker
Translation
OCR
Sentiment analysis
Speech to text and text to speech
Plagiarism checker
Question answering
Digital assistant
Sign language to text converter
Summary
Summary
Existing Bangla NLP market analysis

Market opportunity
Existing tech
Entry barrier and challenges
Spell and grammar checker

Market opportunity
Existing tech

One spell checker from EBLICT: https://spell.bangla.gov.bd/


Some other online spell checkers
No grammar check/correction tool
Some SOTA research came out of the ভাষাভ্রম competition this year
Translation
English to Bangla and Bangla to English
Potential Market
75% of consumers are more likely to buy products from websites in their native language
65% of non-native English speakers prefer content in their native tongue
Was valued at USD 650 million in 2020 and is expected to reach USD 3 billion by 2027
Interested communities:
Technology & manufacturing: Translate manuals of different machineries and different products
Global business people: Translate to understand cultural statements better
Finance and legal: translate documents without any contextual mistakes
Marketing (copy & content writers): Translate from Bangla to English or English to Bangla to
advertise products
E-commerce: Translate to communicate product information
Healthcare: Translate important healthcare information
Freelance writers
Existing tech

Multiple government initiative


Amar Vasha was supposed to use artificial intelligence to translate Supreme Court orders and decisions
from English to Bangla
BUET CSE published a 2.75 million sentence-pair translation corpus
Google's proprietary machine translation technology, dubbed Google Neural Machine
Translation (GNMT), employs recurrent neural networks
Over 4,000 volunteers from 81 locations throughout the nation entered at least
400,000 words into the translation software on a single day to celebrate Independence
Day
Entry barriers

Bangla language structure


Collected corpus was never deployed to build a proper software
Why?
OCR

Potential market: globally valued at 10.65 billion USD


Existing tech

Bangla OCR has been studied since the 1980s


BOCRA and Apona Pathak were introduced
these weren’t open source and weren’t maintained
CRBLP OCR, 2007
Tesseract project
Opensource, maintained by Google
Google Lens works moderately well for OCR as well
Puthi was developed by TeamEngine, with 95% claimed accuracy
But the project failed due to technical reasons, was never released for public use
Apurba developed one which was funded by EBLICT
Let’s see how well it works, shall we?
https://ocr.bangla.gov.bd/
Entry barriers

Developing a completely new dataset for Bangla is difficult. Why?


Alpha-syllabary language family utilizes a cursive writing style and diacritics often, segmenting
graphical components according to characters becomes incredibly challenging.
Broad pixels from the upper or lower portion of a character in a complicated script like Bangla cannot be
removed while eliminating noise because they would erase not just noise but also the difference between
two characters.
The lettering of Bangla words might also make segmentation difficult.
Complex typeface, issues with preservation etc.
No pipeline was developed
Sentiment Analysis

Potential market: The Asia-Pacific market is expected to reach US$523.6 million by


2027, led by nations such as Australia, India, and South Korea.
Existing tech

One publicly available app:


https://sentiment.bangla.gov.bd/sentiment-emotion-analysis
Lots of researchers and students work on sentiment analysis, but still no corpus
publicly available
Entry barriers

Lack of quality data, no standard corpus


A lot of researchers are willing to work on the problem because it’s trendy, not
because they actually want to develop software that can analyze emotions
Social media data has issues
Speech-to-text

Potential market: was valued at USD 1 Billion in 2019 and is expected to grow to
USD 3 Billion by 2027
Existing tech

Some major datasets exist, but no usable model


Not enough data
Lacks variety
Needs three major components:
Acoustic model
Pronunciation model
Language model
Entry barriers

Data acquisition
Need 10k+ hours of speech data
No datasets previously mentioned had more than 500 hours
Text-to-speech

Potential market: Worldwide Text-to-Speech market is expected to reach USD 5790.1


million by 2028, up over USD 2543.1 million in 2021, at a 12.3 percent CAGR
between 2022 and 2028
Existing tech

Kotha, based on Festival, was released in 2007 by CRBLP


Other systems includes Subachan and Anuprash
Entry barrier

Lack of publicly accessible gold standard data


Difficult to compare models
Long term sustainability is an issue
Kotha is still available online, but no one has maintained it in last 10 years, it still needs windows 7 to
run
Speech synthesis by its nature is a difficult task
Plagiarism checker

The global market for anti-plagiarism softwares in the education sector is expected to
increase at a CAGR of 13.8 percent between 2020 and 2027, from USD 819.5 million
in 2020 to USD 2,029.4 million in 2027
Due to the lack of a national plagiarism policy, institutions are sometimes unable to
take action against plagiarized research. No university in Bangladesh even has a
plagiarism policy
Existing tech

No foolproof distinct tech exists at this moment


A couple of old efforts are there: one tried to detect plagiarism from NCTB books
Entry barriers

Lack of plagiarism policy


Extensive data is required
Document similarity techniques are not new, but who are we going to compare it with?

You might also like