How to Search Unstructured Text Spatially

How to Search Unstructured Text Spatially

Spatial Search and Unstructured Data

How often have you tried to search for a document using a place name that you “remember” being in the document? You can recall the document talked about California (for instance) but you just can’t find it. Why is this important? Most office based jobs require us to find information from document sources that we have already gathered. Some of that data may already be indexed and in a neat searchable database but it’s not always that simple. Sometimes it’s in unstructured text and not in a database. Perhaps it’s a news article, or in a company magazine or in a report. Searching the documents may be difficult, unless they are in electronic form and all in the same place. Even if the text has been indexed, finding a document with a reference to California can still be difficult. What happens if the document doesn’t contain the word California? Perhaps you remembered incorrectly. Maybe it referred to a specific place like Los Angeles or San Diego but you only remembered it as California?

The Challenge

Given 1,000 documents, how can you find all those documents that contain references to California? Well, Symilarity’s Helix Insight Engine can do this for you at scale.

1. Ingest Using Natural Language

Symilarity’s Helix search engine can ingest large quantities of documents containing unstructured text and create a searchable index. During this ingestion process, it can use Natural Language Processing (a type of machine learning) to decompose each sentence and extract key data, such as proper nouns and verbs. Proper nouns (sometimes called Entities) would include the names of people, businesses, or places.

This extracted data is placed in separate fields and also indexed to make it easier to find when searching.

Extracting information from unstructured text
Using Natural Language Processing to extract place names from unstructured text

2. Geocoding

Geocoding is the process of translating a place name into a geospatial position. For most of us geospatial position means latitude and longitude although there are other ways of stating it. Helix takes the extracted place names found during the ingestion process and uses an external geocoding service (Bing Maps or Google Maps) to deduce the latitude and longitude. It then stores that data in the index too. This means that every place name in every document has a set of coordinates that might be represented by pins on a map. A document containing references to London and New York would have 2 pins, one in London and one in New York but both pointing to the same document.

As we said earlier, if we were just looking for California (or London or New York), we could have just used text search and found the documents anyway. However, if the entries had been Hollywood, Chelsea or Manhattan, it might not have been so straight forward. If we wish to find all documents that reference California, London or New York, we would have to cover all the potential place names in those locations. This of course is completely impractical. Geocoding the extracted place names solves this problem.

3. Spatial Search

Spatial Search is a concatenation of the words “Spatial” and “Search”. The spatial component of the term is like filtering the list of results to include only results in a certain geographic area. The area might be described by a boundary (e.g. the state boundary of California) or more simply, by drawing a rectangle on a map and showing only the results falling within it. Helix enables all search results to be filtered by drawing a rectangular box around the area of interest (a so-called Bounding Box). The results provided will then only be those documents that refer to a location within the bounding box.

Extracted Place Names from a document, geocoded and visualised on a map, enabling them to be searched spatially

The Solution

Helix ingests large quantities of documents containing unstructured text. It uses natural language processing (a form of machine learning) to identify and extract place names. Spatial coordinates are added to the extracted locations during ingestion using an external geocoding process. This enables documents to be shown on a map, based upon the locations contained in the document. Helix provides spatial search functionality enabling users to find results that only fall within a rectangular area drawn on the map. This means a document can still be found, even if you cannot remember the precise place name.