AI Models Eat Data

Our Neuro-Symbolic AI Models are Designed to be Lean on Data


For Acoustic Model

The STT dataset is a combination of our proprietary medical recordings dataset taken from medical records and newly generated sentences.

The medical recordings dataset comprises millions of text and their corresponding Indian-accent recordings in the English language. This dataset is one of its kind as it has a wide variety in medical terminology as well as in general English sentences (common descriptive sentences and phrases).

The entire STT dataset contains 10 million samples spanning over 10,000+ hours of recorded audio content. This dataset is split into three sets: training (8.2 million samples), validation (1 million samples) and test (1 million samples).

Source: Created by consultants (Proprietary).

For LM

The dataset for LM contains all possible kinds of sentences that can be spoken in a doctor-patient clinical and general conversation or while taking notes, doctor templates, Discharge summaries, emergency notes, general conversation . The dataset has a lot of different categories for each type of text.

The dataset consists of general sentences that can be spoken and the sentences that are used in medical terms. The dataset becomes huge, as different entities like medicine names, lab scans and all can be spoken in permutations.

The general spoken sentences are around 5 million in the whole corpus. Apart from this, the medical data is classified into different categories of sentences, like, medication, diagnosis, symptoms, lab scans, vital notes, etc. The major contribution for medical sentences are from labscan, medication and symptoms as there are a lot of variable factors and a lot of templates for these. The total number of sentences on which the language model is trained comes around to be 1.71 x 1021, which is a lot of data with a huge variety to provide better results.

Source: Medical Records (Proprietary) and Open-source texts.


There is a huge tagged dataset generated by us for centom. Raw medical dataset has been collected from hundreds of medical question-answers. These question-answers belong to various specialties. This text is used to manually tag centoms.

We have generated over 1 million centoms.

Source: Created by consultants (Proprietary).


The co-referencing dataset comprises two different types of datasets- Hint Generation Dataset and Dereferencing Dataset.

Hint generation dataset is made up of thousands of sentences and every sentence contains singularity, gender and living/non-living hints on syntactic level. This dataset contains 30k sentences which is further categorised into 2 sets: training(27k) and test(3k) dataset.

Dereferencing dataset contains a huge number of sentences which will further use the hints(generated on syntactic level) in semantic level.

Source: Created by consultants (Proprietary).

Universal NLU

Universal NLU datasets are a variety of datasets manually tagged by our teams. It is basically the centom dataset which is further tagged for various features and classes. These datasets are used to train models for Universal NLU and thought ecosystems.

The dataset size is 70 thousand sentences.

Clinical Datasets

This dataset includes information such as demographics, vital sign measurements, patient complaints, laboratory test advice and reports procedures, medications, home remedies and follow up details etc. When you go to the doctor a lot of data is collected, stored, processed and analysed. Majority of the data involve numbers like heart rate, temperature, blood pressure. Diagnostic related information include blood tests, culture tests and imagery reports like Xray. Treatment information includes which medicine to be taken, how often and how much quantity. Creating notes for all these is one of the most problematic aspects of doctors.

We used here a system to use voice to create medical notes which is similar to regular workflow. The output generated must follow their expertise domain. The medical terminology used by doctors must be converted into codes used by international standards for medical terms like ICD, SNOMED CT etc. Each term used by a doctor is a concept which has its own code based on its vocabulary. The number of vocabulary resources available are enormous. For this we used UMLS, which comprises 1 million biomedical concepts and more than 5 million concept names. They are categorized into 127 semantic types which are further related with 54 semantic relationships.

Source: Created by clinical team (Proprietary).

Video and Imaging Datasets

This dataset consists of labeled images containing people pointing towards their body parts. It is created as a part of a Real-Time Visual Context De-referencing project. These images are generated from specifically sampled videos in which people are showing various gestures towards their body parts such as head, chest, eyes etc. The dataset contains 70,000+ images containing 60+ different postures.

Source: Created by consultants (Proprietary).

Contact Now to Experience AI for Healthcare

Free Trial

Connect with us now for free trialfor 30 days

  • Start saving upto 3 hrs daily.
  • Focus on your patient and let Simbo take care of rest.
  • Experience piece of mind with hassle-free documentation.
  • Secure, Private and HIPAA compliant AI technology.

Simbo, Inc.45 Prospect St. Cambridge, MA - 02116, USA 800-9944

Contact us

Simbo is making Doctor's and patients lives better.Connect now to get started.