Members of our staff have a deep level of expertise on issues related to multilingual information processing. We've collected the published articles and conference presentations we have given on a wide variety of topics and made them available below.
Basis Products | Chinese Language Issues | Data Quality | Digital Forensics| Entity Extraction | Middle Eastern Language Issues | Name Resolution | Unicode
Basis Technology Products
Building Application with Rosette Name Indexer
and Rosette Name Translator
This presentation will demonstrate how to rapidly construct an application which extracts names from foreign language documents, indexes those names, and automatically generates a high-quality translation into English according to the applicable agency transliteration standard. Real–world examples are presented in Arabic, Chinese, Korean, Pashto, Persian, and Russian, for a total of six scripts and nine languages.
Benson Margulies' presentation at Basis Technology’s Government Users Conference in College Park, MD on May 20, 2008.
Tutorial: Arabic Editor and GeoScope®
This tutorial offers hands-on training with Basis Technology’s Arabic Desktop Suite, an integrated collection of productivity-boosting applications designed for analysts, linguists, and translators.
Tina Lieu and Youssef Fayed: Hands-on Tutorial at Basis Technology’s Government Users Conference in College Park, MD on May 20, 2008.
Tutorial: Transliteration Assistant & Knowledge Center
This tutorial teaches you to prepare reports using standardized transliterations, to automatically translate lists of names, and to exploit online reference materials.
Tina Lieu and Youssef Fayed: Hands-on Tutorial at Basis Technology’s Government Users Conference in College Park, MD on May 20, 2008.
Basis Technology’s Just Right Solutions:
Bigger than a Component, Smaller than a Stovepipe
CTO Benson Margulies speaks about how Basis Technology is moving to solve larger pieces of the problems facing government users by delivering modules of functionality — entity extraction, entity translation, name matching, and geospatial fusion — either pre-assembled into desktop applications or as enterprise software.
Benson Margulies’ presentation at Basis Technology’s Government Users Conference in Washington, D.C. on June 14, 2006.
Different Script, Same Name: Tools for Matching and Translation
This presentation explains how to build multilingual name search and translation capabilities into your application by leveraging innovative products.
David Murgatroyd’s presentation at Basis Technology’s Government Users Conference in Washington, D.C. on June 7, 2007.
Multilingual Deep Web Search
This presentation introduces BrightPlanet’s product, technology, and unique placement in the ‘deep web’ space, and presents the close relationship with Basis Technology and its Rosette® Linguistics Platform.
Duncan Witte, and Dirk Koechner’s presentation at Basis Technology’s Government Users Conference in Washington, D.C. on June 7, 2007.
Adding Linguistics to a Lucene-based Application
This presentation survey’s the challenges and solutions to integrating complex linguistics into this popular open-source application.
Chris Milner, Ph.D., and Steve Cohen’s presentation at Basis Technology’s Government Users Conference in Washington, D.C. on June 7, 2007.
Building Applications with the Rosette® Linguistics Platform
The presentation reviews the capabilities of RLP, the applications for which it can be used, and the techniques it employs. It also focuses on how RLP can be integrated and used in existing systems, and how it can be tuned for each system’s requirements.
Steve Cohen’s presentation at Basis Technology’s Government Users Conference in Washington, D.C. on June 14, 2006.
What Language is That? Using the Rosette® Language Identifier
This presentation gives an overview of the Rosette Language Identifier (RLI) and the techniques RLI uses to automatically identify the language and encoding of a block of text. It also explains how language and encoding identification is an essential stage in the process of working with unstructured multilingual text.
Nobuo Otsuka’s presentation at Basis Technology’s Government Users Conference in Washington, D.C. on June 14, 2006.
Introduction to Basis Technology Transliteration Assistant
This presentation showcases Basis Technology’ Transliteration Assistant (XA), a Microsoft Word and Excel plug-in which enables translators to quickly and consistently produce accurate transliterations of Arabic names.
Melissa Lucius’ presentation at Basis Technology’s Government Users Conference in Washington, D.C. on June 14, 2006.
Introduction to Basis Technology’s Arabic Editor
This presentation provides a broad overview of Basis Technology’s Arabic Editor and Linguist’s Workbench, a powerful and flexible text editing and analysis system. Arabic Editor is best known for providing a simple method for entering and editing Arabic text using a standard “QWERTY” keyboard.
Mary Galvin’s presentation at Basis Technology’s Government Users Conference in Washington, D.C. on June 14, 2006.
Chinese Language Issues
Processing the Mosaic of Chinese Dialects
This presentation explores the taxonomy of modern Chinese and illustrates the aforementioned difficulties through case studies of a dialect, Wu Chinese (spoken in the Shanghai area) and a Mandarin variant, Sichuanese (as spoken in Chengdu, the capital of Sichuan province).
Benjamin Swanson's presentation at Basis Technology’s Government Users Conference in College Park, MD on May 20, 2008.
The Web as a Corpus for Chinese Natural Language Processing
This presentation discusses how Basis Technology created process work, and the problems Basis overcame (or avoided), and how it all turned out, both as a problem of Chinese linguistics and as a challenge of downloading, filtering, and processing terabytes of raw web pages from the Internet.
John O’Neil, Ph.D. presentation at Basis Technology’s Government Users Conference in Washington, D.C. on June 7, 2007.
Chinese Language Analysis: Solving the Chinese Puzzle
This presentation survey’s the problems associated with automatic processing of Chinese. It reviews the various Chinese character sets and encoding systems; input methods and transliteration; and the solutions offered by Basis Technology’s Chinese Language Analyzer and Named Entity Extractor.
Joe Ho’s presentation at Basis Technology’s Government Users Conference in Washington, D.C. on June 14, 2006.
Large Corpus Construction for Chinese Lexicon Development
The World Wide Web provides an important source of natural language data in many languages. However, it doesn't include annotation about linguistic structure, so it's necessary to use very large corpora to infer it. We developed a system for continuous, automatic acquisition of a Chinese lexicon. An up-to-date lexicon is needed for many applications, but Chinese is written without spaces between words, so determining word boundaries is the primary problem. We discuss our experience with using the Chinese Web for lexicon construction, focusing on both low-level details and problems we experienced during our initial proof-of-concept experiments, and on algorithmic issues.
Thomas Emersons presentation from the 29th Internationalization & Unicode Conference, San Francisco CA, March 7 - 9, 2006.
Simple and Complex Chinese Scripts: History and Integration to Unicode
Thomas Emersons presentation from the 24th International Unicode Conference, Atlanta GA September 2003.
Data Quality
Exploiting GeoNames in Practical Applications
This presentation explains how NGA’s data is presently exploited by the Arabic Desktop Suite and future directions.
Tina Lieu’s presentation at Basis Technology’s Government Users Conference in Washington, D.C. on June 7, 2007.
N-Gram vs. Morphological Analysis - Whitepaper
There are two common ways to segment words: N-Gram and Morphological Analysis. Learn the differences between the two by reading this short whitepaper.
Written by Steven Cohen, VP of Products at Basis Technology - June 29, 2006.
Designing Large-Scale Multilingual Systems
Foreign language documents pose challenges for the entire document-management pipeline: identifying the format, extracting text, indexing, search, retrieval, and display. While commonly used technologies work much better than they did a few years ago, there are still many ways to build systems that fail to handle foreign text. This presentation provides an overview of the problem and points out some of the more important issues and traps.
Benson Margulies’ presentation at Basis Technology’s Government Users Conference in Washington, D.C. on June 14, 2006.
Digital Forensics
Digital Forensics
R&D Initiatives at Basis Technology
As criminal and counter-terror investigations cross national and language boundaries, the challenges include not only finding the right documents and evidence among terabytes of data spread across thousands of hard drives, but also searching for keywords or names in different languages, and then interpreting search results in languages unfamiliar to the investigator.
This presentation reviews Basis Technology's digital forensics initiatives as it connects to the broader text analytic and name matching solutions.
Brian Carrier's presentation at Basis Technology’s Government Users Conference in College Park, MD on May 20, 2008.
Multilingual Keyword Search Comes to Digital Forensics
Searching hard drives containing text in foreign language presents technical complexities which most investigators are unaware of: multiple encoding schemes, orthographic variations, spelling variations, and online “chat” dialects. This presentation introduces the Odyssey Digital Forensics system, which has been specifically designed to address these linguistic issues.
Brian Carrier’s presentation at Basis Technology’s Government Users Conference in Washington, D.C. on June 7, 2007.
Cross Drive Analysis: A New Approach to Media Exploitation
This presentation describes correlation techniques for the analysis of large volumes of digital data, and presents results from ten years of research on real-world drives.
Simson Garfinkel’s presentation at Basis Technology’s Government Users Conference in Washington, D.C. on June 7, 2007.
Crash Course in Digital Forensics
This presentations provides an overview of key topics in digital forensics, including the investigation process; analysis techniques and tools; and some examples. It also provides information on new forensics products being developed at Basis Technology and how linguistic analysis techniques will be incorporated into these products.
Brian Carrier’s presentation at Basis Technology’s Government Users Conference in Washington, D.C. on June 14, 2006.
Entity Extraction
Demystifying Entity Extraction Quality
This presentation surveys the types of measurements used for entity extraction quality, and discusses techniques to better extract the data you're looking for when general language models don't fit your needs.
Charlotte Shabarekh’s presentation at Basis Technology’s Government Users Conference in College Park, MD on May 20, 2008.
Middle Eastern Language Issues
Next Generation of Arabic Search: Linguistically Intelligent Retrieval
This presentation demonstrates how a search engine with knowledge of the linguistic components of Arabic – the roots, lemmas and stems – can greatly boost the relevancy of search results.
Zina Saadi’s presentation at Basis Technology’s Government Users Conference in College Park, MD on May 20, 2008.
A Linguistic Profile of the Persian Language and Dialects
This presentation is a brief history of the Persian language, its speakers, and its dialects. It compares Persian to other Arabic script languages such as Arabic, Pashto, and Urdu. It also delves into linguistic aspects of the language, which are important to natural language processing and analysis applications such as, orthography, typography rules, phonology, and spelling variants.
Bushra Zawadeh’s presentation at Basis Technology’s Government Users Conference in College Park, MD on May 20, 2008.
A Profile of Arabic Script Languages
This presentation explores the history of the script in various Arabic script languages, the structure and characteristics of the Arabic alphabet, the alphabet used, the phonological structure, the borrowings, and the differences between Arabic and these languages.
Bushra Zawadeh’s presentation at Basis Technology’s Government Users Conference in Washington D.C. on June 7, 2007.
Arabic, Farsi and Urdu Text Normalization for Natural Language Processing
This presentation suggests a multi-level normalization for handling various Arabic script orthographic variations that appear in current news corpora.
Zina Saadi’s presentation at Basis Technology’s Government Users Conference in Washington D.C. on June 7, 2007.
Decoding Arabic Chat
This presentation decodes the representation of Arabic sounds in the Romanized shorthand commonly used in chatrooms and blogs by presenting findings from field analyses of Egyptian, Gulf, Iraqi, and Levantine online dialects.
Bushra Zawadeh’s presentation at Basis Technology’s Government Users Conference in Washington D.C. on June 7, 2007.
What’s in a Persian Name?
This presentation begins with the basics of Persian phonology and name morphology, and delves into the rich influences of other languages; cultural naming preferences (such as the decline of Arabic-based names after the fall of the Shah in Iran); historical roots; and regional customs.
Zina Saadi’s presentation at Basis Technology’s Government Users Conference in Washington D.C. on June 7, 2007.
Orthographic Variations in Arabic Corpora
This presentation discusses the different kinds of Arabic orthographic issues that Basis Technology’s Arabic linguists have encountered and handled while building various software solutions for Arabic text analysis
Bushra Zawaydeh’s presentation at Basis Technology's Government Users Conference in Washington, D.C. on June 14, 2006.
Behind the Name: Etymology of Arabic Names
This presentation gives some samples of various linguistic rules that contributed to the evolution of certain famous Arabic names. It samples different types of names as well as the influence of various foreign languages; regional and social impacts; and language evolution.
Zina Saadi’s presentation at Basis Technology’s Government Users Conference in Washington, D.C. on June 14, 2006.
Tailoring UAX #29 Word Breaking for Arabic Text
Thomas Emerson's presentation at the 28th Internationalization & Unicode Conference in Orlando, FL on Sept. 8, 2005.
Name Resolution
Everything You've Ever Wanted to do With Names
This presentation will explore challenges of multilingual name resolution, retrieval, and translation. It will also demonstrate Basis Technology’s products which enable rapid identification of names in multiple languages and automatic, high-accuracy translation of those names into English.
David Margatroyd’s presentation at Basis Technology’s Government Users Conference in College Park, MD on May 20, 2008.
Linguistic Considerations of Identity Resolution
This presentation will consider metrics and data for evaluating identity resolution and retrieval systems. Itl also explores the linguistic challenges these systems face.
David Margatroyd’s presentation at Basis Technology’s Government Users Conference in College Park, MD on May 20, 2008.
Unicode
Unicode 5.0 Essentials
This presentation begins with a look at how Unicode, established in 1991, has changed the way computers process text, with particular emphasis on Arabic, Chinese, Japanese, and Korean. For the non-programmer, this presentation briefly presents foundational concepts of encodings, characters, glyphs, code points, and the design principles behind Unicode.
Tina Lieu’s presentation at Basis Technology’s Government Users Conference in Washington, D.C. on June 7, 2007.
Hewlett Packard Breaks the Printer Barrier of Global Operations:
Basis Technology reviewed HP’s International Print Solution. Hewlett-Packard introduced technology to help companies overcome a key barrier to global operations — how to print documents correctly everywhere despite differences in language and script. Read our review.
Written by: Benson Margulies, CTO, Basis Technology
Understanding Unicode 5.0
This presentation provides a gentle introduction to the basic concepts of the Unicode 5.0 standard, including characters, encodings, transcoding, byte ordering, and the common UTF 8 and UTF 16 transformation formats. Also covered is practical information about support for Unicode in popular operating systems, computer languages, and protocols.
Ken Glidden’s presentation at Basis Technology’s Government Users Conference in Washington, D.C. on June 14, 2006.
Big Dots, Little Dots, and Circled Dots: How Unicode can help (and hurt) the process of converting documents to information.
Basis Technology CTO Benson Margulies keynote address from the 25th International Unicode Conference, Washington D.C. March 2004.
Finite State Automata in Unicode, Take 2
Thomas Emersons presentation from the 24th International Unicode Conference, Atlanta GA September 2003.




