![]()
One keyword matches all variations of a word
More and more, digital investigators are encountering hard disks with text in languages that they do not speak. This makes the challenge of finding and interpreting the relevant data greater than in a typical investigation because the investigator may not know about nuances in word variations, the analysis tool may not know about all file formats and text encodings, and the investigator may not be able to read text that he finds.
Basis Technology’s Odyssey Digital Forensics™ Keyword Search helps to find and interpret the relevant documents by applying advanced linguistic processing techniques. The linguistic techniques allow the investigator to find files that would not be found using typical forensics tools.
Most languages have words with spelling variations that convey the same meaning. In English, “color” and “colour” both refer to the visual attributes of something and “carry” and “carried” both refer to holding something. When keyword searching, the investigator needs to take these variations into account.
When the keywords are in the investigator’s native language, he will be aware of the variations he needs to consider. However, he may not know the variations in the keyword language. For example, Arabic vocalizations may not always be present (لُغَوِيَّة versus لغوية) and the alef maksura character (ى) may be used interchangeably with the yeh character (ي). In Japanese, location names may be spelled differently depending on the region of Japan (三河槇原 versus 三河槙原).
Odyssey uses the Rosette® Linguistics Platform to preprocess multilingual text with its text normalization functions (see sidebar). Odyssey uses the normalized Arabic, Chinese, Japanese, Korean, Farsi (Persian), and Urdu text to build a search index. Then analysts type in search terms through a simple graphical interface to search this linguistically enhanced index. A single search allows them to find variations of their keyword, including numbers that were written in a different numbering system.
The term 1425 was searched for and the Arabic equivalent of that number was found and is highlighted.
Finding additional files does not help solve the case if the file content cannot be interpreted. Frequently, translators are used to help with this process, but they can be both hard to find and expensive.
Odyssey helps the investigator triage documents by identifying and translating names in a file. The Rosette Entity Extractor is used to identify the names of people, places, and organizations in the file. Next, the Rosette Name Translator is used to translate the names into Latin script. The translation of the name is highlighted and presented to the user. This process allows the investigator to identify the documents that are most relevant and that should be translated first.
This process allows the investigator to identify the documents that are most relevant and that should be translated first.
The text normalization and name translations are the base features of Odyssey, but other linguistic processing techniques can also be incorporated for custom solutions. The Cross Language Toolkit can be used to translate English keywords into Arabic so the user can find Arabic documents without knowing the Arabic word. The Rosette Name Indexer can be incorporated to highlight names that are in a watch list or to allow the analyst to search for transliterated name variations.
The Rosette Chinese Script Converter can be incorporated to convert between Traditional and Simplified Chinese text so that a search for a Simplified Chinese keyword would also find Traditional Chinese documents. Odyssey could also convert between the Japanese kanji, hiragana, and katakana scripts.
Odyssey analyzes both disk images and local files. When processing a disk image, it can use the NIST National Software Reference Library (NSRL) hash database to ignore known files. This saves indexing time and reduces false positives.
Disk images can be stored in either raw or Expert Witness (E01) formats. Odyssey uses The Sleuth Kit to process NTFS, FAT, Ext3, and UFS file systems and recover deleted files. Odyssey uses a custom algorithm to analyze binary data to locate Unicode text in different languages.
