UHG
Search
Close this search box.

IBM Reveals its Entire 6.48 TB LLM Training Dataset

Armand Ruiz has revealed the entirety of its comprehensive 6.48 TB dataset used to train Granite 13B.

Share

IBM

In May, IBM open sourced its Granite 13B LLM, ideal for enterprise use cases. 

Now, Armand Ruiz, the VP of product – AI platform at IBM, has revealed the entirety of its comprehensive 6.48 TB dataset used to train Granite 13B. 

This dataset, after undergoing rigorous pre-processing, was reduced to 2.07 TB, reflecting a 68% reduction. Ruiz emphasised that this step was essential to ensure a high-quality, unbiased, ethical, and legal dataset tailored for enterprise use cases.

The dataset was meticulously curated from a variety of sources, including:

  • arXiv: Over 2.4 million scientific paper pre-prints.
  • Common Crawl: Open repository of web crawl data.
  • DeepMind Mathematics: Mathematical Q&A pairs.
  • Free Law: Public-domain legal opinions from US courts.
  • GitHub Clean: Code data from CodeParrot.
  • Hacker News: Computer science and entrepreneurship news from 2007-2018.
  • OpenWeb Text: Open-source version of OpenAI’s Web Text corpus.
  • Project Gutenberg (PG-19): Free e-books with a focus on older works.
  • Pubmed Central: Biomedical and life sciences papers.
  • SEC Filings: 10-K/Q filings from the US SEC (1934-2022).
  • Stack Exchange: User-contributed content on the Stack Exchange network.
  • USPTO: US patents granted from 1975 to May 2023.
  • Webhose: Unstructured web content converted into machine-readable data.
  • Wikimedia: Eight English Wikimedia projects.

The pre-processing pipeline included several key steps including Text extraction, deduplication, language identification, sentence splitting, hate, abuse, and profanity annotation, document quality annotation, URL block-listing annotation, Filtering, Tokenization.

These steps, involving annotation and filtering based on defined thresholds, ensured that the final dataset was of the highest quality for model training.

IBM has released four variations of the Granite code model, ranging in size from 3 to 34 billion parameters. The models have been tested on a range of benchmarks and have outperformed other comparable models like Code Llama and Llama 3 in many tasks.

📣 Want to advertise in AIM? Book here

Picture of Mohit Pandey

Mohit Pandey

Mohit dives deep into the AI world to bring out information in simple, explainable, and sometimes funny words.
Related Posts
Association of Data Scientists
Tailored Generative AI Training for Your Team
Upcoming Large format Conference
Sep 25-27, 2024 | 📍 Bangalore, India
Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

Flagship Events

Rising 2024 | DE&I in Tech Summit
April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore
Data Engineering Summit 2024
May 30 and 31, 2024 | 📍 Bangalore, India
MachineCon USA 2024
26 July 2024 | 583 Park Avenue, New York
MachineCon GCC Summit 2024
June 28 2024 | 📍Bangalore, India
Cypher USA 2024
Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA
Cypher India 2024
September 25-27, 2024 | 📍Bangalore, India
discord icon
AI Forum for India
Our Discord Community for AI Ecosystem, In collaboration with NVIDIA.