AI Models Eat Data

Our Neuro-Symbolic AI Models are Designed to be Lean on Data

Prescription Datasets from Health-PIE

This dataset is made up of prescription-images and their data-content in structured format as Electronic Medical Records (EMR) duly verified by clinician. Structured data has a very detailed schema - eg. attributes for medications prescribed are: med-name, code, mode of administration, dosage or quantity, frequency, duration, whether before/after food. Majority of prescription-images are hand-written notes while a few have printed content. The top specialities are: GP, Pediatrics, Gynae, Neuro, Cardio, Ortho and Gastro. These records are from various departments such as Out-Patient, In-Patient (discharge-summaries), Emergency, Pathology/Radiology (investigation-reports).

Total number of records is over 1 Million from over 0.6 Million individuals. There are over 8 Million individual structured-items from these EMR's. The EMR captures important details like vitals, medications, diagnosis, diagnostics advised, symptoms, procedures done, diagnostics done, follow-up, referrals, etc. Records belonging to the same individual are linked together - providing a view of the journey from OP to IP to discharge. Individuals related to each other are also linked.

Source: Generated from Health-PIE product (Proprietary).


For Acoustic Model

The STT dataset is a combination of our proprietary medical recordings dataset taken from Health-PIE and newly generated sentences.

The medical recordings dataset comprises millions of text and their corresponding Indian-accent recordings in the English language. This dataset is one of its kind as it has a wide variety in medical terminology as well as in general English sentences (common descriptive sentences and phrases).

The entire STT dataset contains 10 million samples spanning over 10,000+ hours of recorded audio content. This dataset is split into three sets: training (8.2 million samples), validation (1 million samples) and test (1 million samples).

Source: Created by consultants (Proprietary).

For LM

The dataset for LM contains all possible kinds of sentences that can be spoken in a doctor-patient clinical and general conversation or while taking notes, doctor templates, Discharge summaries, emergency notes, general conversation . The dataset has a lot of different categories for each type of text.

The dataset consists of general sentences that can be spoken and the sentences that are used in medical terms. The dataset becomes huge, as different entities like medicine names, lab scans and all can be spoken in permutations.

The general spoken sentences are around 5 million in the whole corpus. Apart from this, the medical data is classified into different categories of sentences, like, medication, diagnosis, symptoms, lab scans, vital notes, etc. The major contribution for medical sentences are from labscan, medication and symptoms as there are a lot of variable factors and a lot of templates for these. The total number of sentences on which the language model is trained comes around to be 1.71 x 1021, which is a lot of data with a huge variety to provide better results.

Source: Health-PIE (Proprietary) and Open-source texts.


There is a huge tagged dataset generated by us for centom. Raw medical dataset has been collected from hundreds of medical question-answers. These question-answers belong to various specialties. This text is used to manually tag centoms.

We have generated over 1 million centoms.

Source: Created by consultants (Proprietary).


The co-referencing dataset comprises two different types of datasets- Hint Generation Dataset and Dereferencing Dataset.

Hint generation dataset is made up of thousands of sentences and every sentence contains singularity, gender and living/non-living hints on syntactic level. This dataset contains 30k sentences which is further categorised into 2 sets: training(27k) and test(3k) dataset.

Dereferencing dataset contains a huge number of sentences which will further use the hints(generated on syntactic level) in semantic level.

Source: Created by consultants (Proprietary).

Universal NLU

Universal NLU datasets are a variety of datasets manually tagged by our teams. It is basically the centom dataset which is further tagged for various features and classes. These datasets are used to train models for Universal NLU and thought ecosystems.

The dataset size is 70 thousand sentences.

Clinical Datasets

This dataset includes information such as demographics, vital sign measurements, patient complaints, laboratory test advice and reports procedures, medications, home remedies and follow up details etc. When you go to the doctor a lot of data is collected, stored, processed and analysed. Majority of the data involve numbers like heart rate, temperature, blood pressure. Diagnostic related information include blood tests, culture tests and imagery reports like Xray. Treatment information includes which medicine to be taken, how often and how much quantity. Creating notes for all these is one of the most problematic aspects of doctors.

We used here a system to use voice to create medical notes which is similar to regular workflow. The output generated must follow their expertise domain. The medical terminology used by doctors must be converted into codes used by international standards for medical terms like ICD, SNOMED CT etc. Each term used by a doctor is a concept which has its own code based on its vocabulary. The number of vocabulary resources available are enormous. For this we used UMLS, which comprises 1 million biomedical concepts and more than 5 million concept names. They are categorized into 127 semantic types which are further related with 54 semantic relationships.

Source: Created by clinical team (Proprietary).

Video and Imaging Datasets

This dataset consists of labeled images containing people pointing towards their body parts. It is created as a part of a Real-Time Visual Context De-referencing project. These images are generated from specifically sampled videos in which people are showing various gestures towards their body parts such as head, chest, eyes etc. The dataset contains 70,000+ images containing 60+ different postures.

Source: Created by consultants (Proprietary).

Contact Now to Experience AI for Healthcare

Free Trial

Connect with us now for free trialfor 30 days

  • Start saving upto 3 hrs daily.
  • Focus on your patient and let Simbo take care of rest.
  • Experience piece of mind with hassle-free documentation.
  • Secure, Private and HIPAA compliant AI technology.

Simbo, Inc.867 Boylston St. Boston, MA - 02116, USA 800-9944

Contact us

Simbo is making Doctor's and patients lives better.Connect now to get started.