United Formula Annotation

A chemical compound's molecular formula represents its elemental composition, and is a fundamental property. Assigning molecular formulas to peaks in data generated using untargeted LC/HRMS can help in gaining biological insights from metabolomics and exposomics datasets. It can complement the peak annotation pipelines that need MS2 spectra to assign a structural identity to a peak. Formulas can be assigned using only MS1 spectral data which is available for every sample analyzed using a LC/HRMS instrument in a metabolomics or exposomics study.

Because of the naturally occuring isotope atoms for each element, MS1 spectral data have more than one mass to charge ratio (m/z) values observed for an ionized species. The isotopic pattern for a chemical structure can be accurately predicted using a set of combinatorial rules that uses atomic mass tables provided by the International Union of Pure and Applied Chemistry (IUPAC). To assign a molecular formula, the theoratical isotopic profile of carbon-containing compounds can be queried against the MS1 spectral data using a set of matching criteria and scoring system. Because of the universality of molecular formula assignment, almost all commericial and academic software to process untargeted LC/HRMS datasets have a feature to search a single or list of molecular formulas against the raw MS1 data. Community guidelines for peak annotation also recommend performing the molecula formula assignment step on untargeted LC/HRMS datasets.

While existing solutions offer a straightforward solution to match theoratical isotopic patterns against the MS1 spectral data, there is still an unmet need to improve the workflow for larger studies and various sources of molecular formula. This is important for exposomics studies where we do expect to see many more compounds from formula sources other than common metabolite databases.

Therefore, the Integrated Data Science Laboratory for Metabolomics and Exposomics (IDSL.ME) at the Icahn School of Medicine at Mount Sinai has developed a new software, the United Formula Annotation (UFA). The software, IDSL.UFA can:

  1. Enumerate molecular formula with filtering rules or the sub-set sum problem for a single m/z

  2. Compute theoratical isotopic profile using the atomic mass data provided by the IUPAC

  3. Match and assign a rank to molecular formula hits on individual peak list

  4. Refine the ranking on the aligned peak-table

  5. Scale well for larger studies (n > 500) using a multi-threaded processing

The software has been provided as a standard R package ( ) and have been rigoursoly tested on several large datasets of authentic standards and untargeted LC/HRMS datasets for human specimens. We noticed the IDSL.UFA can greatly improve the scope of new discoveries from untargeted LC/HRMS datasets. A very user-friendly parameter file has been provided in the Excel format to minimize the R-scripting efforts and to enable seamless integration of the IDSL.UFA in other peak annotation pipelines.