Futureweb AI - AI Solutions for Your Business

Working with Hugging Face Datasets: A Guide to Efficient Data Handling for Machine Learning

By Aryan October 13, 2024 8 min read

Table of Contents

Hugging Face’s datasets library is a specific Python library for loading and processing datasets in the natural language processing (NLP) domain, although it’s versatile enough to be used more broadly.

This library provides easy-to-use methods to download, cache, and process datasets. It’s built on top of Apache Arrow and uses Arrow’s efficient in-memory format for its operations. This means that when you’re working with a dataset using the Hugging Face datasets library, under the hood, your data is stored in the Apache Arrow format.

The key advantage of Hugging Face datasets is the ease of use in the context of machine learning and NLP tasks, providing access to a large repository of pre-existing datasets, along with efficient data manipulation tools.

Example 1: Loading HuggingFace Dataset

Loading a model from huggingface is pretty straightforward. All we have to do is to install the datasets library if not already done and

!pip install datasets

from datasets import load_dataset

dataset = load_dataset("rotten_tomatoes", split="train")

###Check first element of dataset

print(dataset[0])

### Slice the dataset to get first 5 elements

print(dataset[:5])

Example 2: Create Dataset From DataFrame

In most of our daily python operations we use pandas dataframe since it is easy to work with spreadsheets, so we will see an example of how to convert a dataframe into dataset.

import pandas as pd

from datasets import Dataset

# Example DataFrame

data = {'Column1': [1, 2, 3], 'Column2': ['a', 'b', 'c']}

df = pd.DataFrame(data)

# Convert to Dataset

dataset = Dataset.from_pandas(df)

Working with Hugging Face Datasets: A Guide to Efficient Data Handling for Machine Learning

About the Author

Aryan