In topic 2, we introduced the concepts of structured (i.e., highly organized) and unstructured (i.e., not organized) data. Structured data is what you would see in an Excel spreadsheet where there are column headings and rows with many observations. Unstructured data tends to be in a text format, like a transcript from a company's earnings call.
Categorical data categorize items represented by words - such as classifying a group of people by gender (male, female, nonbinary), or labeling transaction types (sales versus returns) or inventory costing method (FIFO, LIFO, average cost).
Numerical data are meaningful numbers such as transaction amount, net income, age, or the score on an exam.
Within categorical data there are two subsets:
Nominal data, and
ordinal data.
Nominal data cannot be ranked (e.g., gender or transaction type). These data are usually summarized using counts or proportions (e.g., what proportion of participants are female?).
In the example above, Transaction_Type is a Categorical-Nominal data type.
Ordinal data has a natural, ordered, or ranked categories. It can be summarized by counting and grouping, taking a proportion, or ranking.Â
In the example above, Date is a Categorical-Ordinal data type. Each day in January 2021 is a separate category, and the categories are ordered: January 2nd comes after January 1st and so on.
There are four primary methods to summarize numerical data:
Counting and grouping,
proportion,
summing, and
averaging.
Interval data is measured along with a scale (e.g., temperature or SAT scores). Ratio data, on the other hand, is numerical data with an equal and definitive ratio between each data point (e.g., transaction amounts, expenses, revenues, assets, salary, taxes).