The basis of all data-recording, sharing, and retrieval, regardless of the type of data – from computer code to war stories told by grandparents to public accounting records – is language. This is no different for OSINT data. Practitioners collect and evaluate human-generated, open-source content, based on human language with tools that are coded using computer languages, using queries formatted to a specific structure.
You get the point. Language – is super important.
Leaving aside the content-aspects of language found on the web – we’d like to discuss the other connotation of language in the OSINT industry: the aforementioned tools.
OSINT Data-querying and Investigation Tools
As it stands, the current OSINT landscape is filled with tools that allow you to collect and query OSINT data, and some of those also allow for the investigation of that data. All of these tools are predicated on the same framework: OSINT data from data streams is queried and collected through the means of some querying language. An example of this is Boolean logic, utilizing a set of keywords to query against. Even rule-based automation for the collection of data is still – at their heart – operating on a structure of keyword matching.
While this system of data collection is effective for the most part, it does present the challenge of false positives.
False Positives
The issue with false positives when using keyword groups stems from the complexity of the human dialect – in all of its variations and applications. Essentially – it boils down to the fact that very little of a human’s communication structure is concrete and unmoving; there’s so much context that shapes the meaning of our communication.
And – for as powerful as we’ve made the average computer in modern times – it still isn’t very good yet at understanding the nuances of the human context within our communications. A computer or program looking for keywords is going to match exactly keyword to keyword. The computer, generally, can’t care what is meant by the word or why it was used – it can only present you the word you asked for. See the following examples:
“I’m going to bomb some public place.”
“I just got back from the restaurant – that place is the bomb!”
From the perspective of the English language and the context of social communication – those are two wildly different statements, that convey two very different sentiments. To you and I – we can see that, based on our experience and understanding of the English language, which ultimately provides us with a context – a lens – to view those statements through.
To a computer – it just sees the word bomb as an exact match of your keyword. Meaning, in the above two statements, depending on which sentiment you’re actually querying to find, you’ll have opposite contexts matching as false positives.
As any practitioner of any modern security vertical will tell you – false positives in any form are a time-sink. In the world of OSINT – it means sifting through a whole lot of posts/content that have a 1:1 match to a keyword you’re using but has nothing to do with what you’re actually searching for. Sigh.
Learn how to apply the right investigative lens to understand the behaviour & language of threat actors across the web with our latest report.
Machine Learning to The (Semi-)Rescue
What the heck is Natural Language Processing?
Natural Language Processing (NLP), or Natural Language Understanding, is a form of machine learning that allows for the teaching of context, essentially, to a data processing program. Its goal is to enable the rough creation of a human-like understanding of language.
Without getting too into the weeds into the technical specifics, this is done via a neural network – a system that can “learn” from vast amounts of real-world data. When given this data – the neural network acquires information about how words are used in the English language, and by extension distinguish between contexts. When complete – the language model can then be applied to specific tasks, like language classification models: the act of deciding the context of a specific selection of text, and labelling it as such. Applied at Media Sonar – language classifier is helping to identify when violent threats are made on publicly available sites – not only through keyword matches but also the relevance to the context of the topic.
In the end – NLP and machine learning will allow a computer to make a human-like intelligent decision about a match based on a keyword – but most importantly – driven by an applicable and contextual understanding of the entire text.
While it will not completely replace traditional keyword searching at this point – it will be a bolstering factor to efficacy and relevancy of search results.
The next step: A smarter way to analyze the data
So why is this important, beyond the “cool factor” of employing Machine Learning? A big piece of what we strive to do at Media Sonar for our end-users is to provide the most effective results while producing considerable time/energy savings over historically-manual tasks.
We believe that NLP and other Machine Learning processes are the next long-term step in that journey.
Language classifiers applied by ML-processes have the potential to eliminate a great deal of the false positives that are inherent to traditional keyword searching. Recall – a keyword search pulls back exact hits on a word – the computer is just making exact matches to the words you originally input into the query. Even if a practitioner was to sit down and craft an incredibly complex search query – the results are always going to be filled with false positives that a practitioner traditionally would have to sift through. The only way to combat that is to include more specific terms and contexts of statements into your query. But even then – false positives will still exist, resulting in more work, time and energy.
As a segue to that – the other issue is the vast knowledge of keywords a subject can demand in order to write an effective enough query in the first place. With the sheer amount of complexity of communication in most languages – alternate word use, slang, context – it can demand that a practitioner collect and maintain an exorbitantly large library of keywords in their head. Again, that equates to time and energy.
Returning results based on their pattern of use and context – eliminating many erroneous 1:1 matches, and without the need for a practitioner to maintain a personal knowledge of all word uses – with a relevancy score will then lend itself to higher levels of confidence in the data being returned. By returning more accurate results, faster, with less effort – we allow security practitioners to return to other tasks that demand time and effort.
Be on the lookout for more about our language classification Proof of Concepts in the future; we’ll be working closely with our customers to beta new classifications as we develop them.
We’d like to show you what else we’ve been working on that will help your security team save time and energy performing OSINT investigations. Book a demo with a product specialist.