Text Analytics and Data Discovery
Research and investigation is undertaken across public and private organisations in all vertical sectors. Whether undertaken by law enforcement, product development, marketing or finance, research and Investigation tasks are often characterised by the need to process large data volumes from multiple sources. This data might be a diverse range of content including:
- user feedback (emails/tweets)
- academic research papers
- newspaper articles
- press releases
- witness statements
- evidence submissions
- case histories
A common factor across all these data sources, is they are expected to contain valuable information and yet most of the data is likely to be an unstructured format. Therefore, a range of tools and techniques (sometimes called text analytics or text mining) are required to analyse, discover, mine and represent the important information within the documents.
How can Symilarity’s Enterprise Search/Insight Engine help?
Helix, Symilarity’s Enterprise Search/Insight Engine provides key text analytics functionality to assist in the research and investigation task.
Ingestion and Indexing
Helix ingests most common file formats and creates an index for the data that is searchable and supports natural language processing. Indexes are stored in repositories, enabling datasets to be segmented.
Helix provides full search capability including boolean operators, fuzzy matching, proximity measures and boosted terms. Helix can search across repositories concurrently (federated search).
Named Entity Extraction
Helix provides natural language processing functionality and will extract references to people, places and organisations.
Helix will identify named entities and the relationship between them. This information can be extracted into a graph database to visualise the connectedness of the data and for further analysis.
Helix uses Vector Space Modelling to cluster documents into a user defined number of clusters. Clusters can be explored manually to determine the commonalities within them.
Helix uses IBM’s Watson Natural Language Understanding (NLU) to enrich the document with a sentiment score.
Helix provides unstructured machine learning using Vector Space Modelling, to reduce the dimensionality of data. This enables additional search functionality. Term similarity provides an insight into the content of the data. Whole documents can be used as search terms, so similar documents can be found
Quantitatitive Text Analysis
Helix provides analytics for word count by document or by repository. It also allows the comparison of fields within two repositories.
Helix provides url and postcode extraction. Helix enriches postcodes and place names with latitude and longitude, enabling spatial search of data.
Research papers about Human Gut Biome, processed by Helix and the extracted text exported to a graph database for visualisation