How do we store Data? For each source of data, there is a relevant system for storage for that type of data. A source’s storage system can be optimized to adequately store that data, not set it up for data analysis and / or extraction, similar to Online Transaction Processing (OLTP) and Online Analytical Processing (OLAP).
Challenges of Source Storage Systems (not optimized for analysis)
- have more information than is necessary, relevant, or useful for analytics;
- can be essential to daily operations, thus it is advisable to avoid making changes to them during the analysis as this could cause the system to slow down or crash.;
- short-term data archiving.
Hence, it is ideal to have a separate storage location for the data to be analysed, be it virtual or physical with the main point being accessibility for the users. It could also be a combination of the two or a semi0-centralised repository.
We have looked at where data can be stored, now, how can we store it? There are two main methods in this respect, Datafiles and Databases.
To do this, we first consider File systems. Think about a physical organised file system, these are the digital equivalent. It can be us taking information and putting it in a folder then maybe putting that folder in another folder, as seen on how we use our personal computer systems to save documents. We appreciate this be it can handle unstructured data (the content of the file is not a concern). Although, how to access the data is not obvious. You need a media file viewer to open a JPEG or PNG file. Furthermore, how do you analyse a photo, video or document without a structed system or an interface. An example of a file system is a Hadoop Distributed File System (HDFS). A Big Data manifestation of file system. This efficiently store very large amounts of information with no consideration to data type by using parallel processing on inexpensive infrastructure.
- Delimited text file- This contains data representing two dimensional tables with rows and columns. The data, stored as texts has breaks between the rows and columns and these breaks are identified with the use of specific formatting codes or characters called delimiters. The common delimiters are commas, tab and pipes (” | ” is a pipe). A file delimited with a comma ends with “.CVS” (Comma Separated Values), whereas, Tab and Pipe files end with the “.TXT” (Text) file extension.
- Extension Markup Language (XML) file- this file type was developed in the 90s. It has a flexible structure suited for encoding documents and data was primarily created to facilitate the sharing of data over the internet and it was also used for web apps and messaging systems. Although common, it allows for a more complex structuring of data but a more sophisticated interface is required to interpret the data and structure for analysis when compared to a delimited text file.
- Log file- This is common in machine data as well as web analytics applications and it is primarily used to record event data from a system. It may or may not follow a standard structure. They are very flexible, being able to capture practically any data structure but a very complicated process for reading and using the data is needed. A parser is required to read and interpret the file.
- Metadata- This is unique to popular data analysis software. It explains computations, actions, or characteristics of the data itself. Microsoft Excel spreadsheet files are the most popular type of this file. Data can be stored in stand-alone files using tools like SaaS, SPSS, and Tableau, all of which have unique file formats.