Difference and classification of datasets (2023-09-23 11:18:53)

In the previous post, we mentioned that datasets are a collection of data. When we discuss collections, we conclude from this definition that the type of data itself can determine the classification of the dataset to a large extent. One of the benefits of this separation is adding new data to the relevant category. Every library or institution (Google Dataset Search, Kaggle, Amazon, Github, etc.) that has been active in this field has provided different categories, but in this article we want to review some of the most important ones.

این چیه؟

 

# Text

These types of input sets are language processing machines (NLP) and as their name suggests, they are a collection of texts and documents that can be a wide range of news, articles, books, site feeds, dictionaries, and include such cases. One of the largest and most famous datasets in this category is the Wikipedia collection, which is used in the preparation of Word net (which we will have an article about later).

 

# Vision

Since it has more tangible results than the others and there are more resources for preparing images, working with this type of dataset is probably the most attractive for especially beginners. Libraries such as TorchVision have prepared small datasets such as FashionMNIST (a collection of clothing images in 10 different categories) and CIFAR10 (a collection of animal and vehicle images in 10 different categories) for these people. Also, for those who plan to do bigger projects, they either use Kaggle datasets and the like, or they prepare the data they need themselves.

 

# Audio

Voice recognition and extracting text from them are two examples of the wide applications of this type of data set. Conversations in movies, animations, news, radio programs and speeches are the best sources of collecting this type of data. One of the examples available in the Kaggle resources, named Speaker Recognition Audio Dataset, is a collection of 50 speeches of more than one hour.

 

# Numerical

A set of statistical and mathematical information is included in these categories. If we want to better understand the concept of these data sets, they can be considered as tables of numbers, each of which shows a value of a specific feature. For example, if we want to have the location of houses in a city, we can have their latitude and longitude values ​​in a table. Among the most knowledgeable of this type of data sets are the collections of the Scikit-learn library, which can be presented as colored points in different charts.

 

# Classified

Another simple set for small tasks, especially Binary Classification, is data that is characterized by a value of Zero and One or False and True or some thing like them. For a better understanding, the example of Male and Female members of society can be used as an example. In the meantime, you can use a value of 1 for being a woman and 0 for being a man.

 

Labeled and UnLabeled

From another point of view, information can be categorized based on whether or not they have a specific purpose or label. Based on this, if a special name is considered for each sample of the data sets, or it is placed in a special folder or directory to be separated from other data, or in numerical cases, a separate column is considered to display the output of that sample, will be recognized as labeled datasets. Otherwise, they will be unlabeled. In some cases, we can also see consolidated information that solves this problem with methods such as using labels for training and without labels for testing.

 


Comments


Inchieh is both an idea and a learning challenge for us (and of course for anyone who wants to be with us). The purpose of the idea can be seen from its name. You are supposed to say what each picture is. فارسی