it:textfiles

text file formats and markup languages such as YAML

Introduction

  • general .txt files contain unstructured data
  • rich text files (.rtf) and MS Word document files (.doc) store formatting codes along with generally unstructured text data
  • however data is often needed to be stored in a structured manner so that applications can accurately extract the data in a meaningful manner and that is where defined markup languages come in

Formatted text files

rich text format (*.rtf)

  • introduced by Microsoft in 1987 as part of their MS Word for Apple Mac
  • uses groups, a backslash, a control word and a delimiter.
    • Groups are contained within curly braces ({}) and indicate which attributes should be applied to certain text.
    • The backslash (\) introduces a control word, which is a specifically programmed command for RTF.
      • Control words can have certain states in which they are active.
      • These states are represented by numbers.
 {\rtf1\ansi{\fonttbl\f0\fswiss Helvetica;}\f0\pard
 This is some {\b bold} text.\par
 }

Data structured document formats

Windows ini configuration files (*.ini)

  • these were commonly used (and still are) prior to the introduction of the Windows Registry which generally superceded the use of inifiles
  • it contains sections for settings and preferences (delimited by a string in square brackets) with each section containing one or more name and value parameters
  • a comment line starts with a semicolon
  • an example is the system.ini file in Windows
; for 16-bit app support
[386Enh]
woafont=dosapp.fon
EGA80WOA.FON=EGA80WOA.FON
EGA40WOA.FON=EGA40WOA.FON
CGA80WOA.FON=CGA80WOA.FON
CGA40WOA.FON=CGA40WOA.FON

[drivers]
wave=mmdrv.dll
timer=timer.drv

[mci]

comma separated value (CSV) files

  • these are standard text files containing data with:
    • each line representing a row in a data table
    • each value within a line separated by a comma represents a sequential value corresponding to the data table's column
  • these files can be imported into MS Excel and each value is copied into a separate cell in an ordered manner
  • the file may not contain the meaning of the data in each line unless there is a header line containing this - ie it is hard to read by humans if there is a lot of data as it is easy to get confused by the value and the column it is meant to be in

Markup languages

IBM's Generalized Markup Language (GML)

  • developed in the 1960's

Standard Generalized Markup Language (SGML)

  • created in 1986 as an ISO standard, being descended from IBM's Generalized Markup Language (GML)
  • originally designed to enable the sharing of machine-readable large-project documents in government, law, and industry.
  • it was extensively applied by the military, and the aerospace, technical reference, and industrial publishing industries
<!--
Copyright (c)  2002  your name, NewbieDoc project;
http://sourceforge.net/projects/newbiedoc
Permission is granted to copy, distribute and/or modify this
document under the terms of the GNU Free Documentation License,
Version 1.1 or any later version published by the Free Software
Foundation; with no Invariant Sections, with no Front-Cover
Texts, and with no Back-Cover Texts. A copy of the license can
be found at http://www.fsf.org/copyleft/fdl.html.
-->

<!DOCTYPE article PUBLIC "-//OASIS//DTD DocBook V3.1//EN">

<article id="hello-world" lang="en">
   <sect1 id="introduction"><title>Hello world introduction</title>

      <para>
      Hello world!
      </para>

   </sect1>
</article>

HyperText Markup Language (HTML)

  • this was derived from SGML by the XML Working Group in 1998 to form the code behind web pages which would instruct a web browser of content and formatting
  • it is an object oriented key-value format with cascading child objects with each value enclosed in a bracketed key start and end
 <!DOCTYPE html>
<html>
<body>

<h1>My First Heading</h1>
<p>My first paragraph.</p>

</body>
</html> 

Extensible Markup Language (XML)

  • The XML Working Group conceived XML in 1996 and released its initial version in 1998. As with HTML, they derived XML from the Standard Generalized Markup Language (SGML). After introducing HTML in 1998, they developed XML as a data serialization tool.
  • the XML format became extremely popular as a way of not only storing configuration settings but in storing and transferring data
  • it is an object oriented key-value format with cascading child objects with each value enclosed in a bracketed key start and end similar to html although for readability, the child parts are indented from the parent parts (although this is not critical for it to be read by machine)
  • a comment line is demonstrated below as <!– no-op –>
<?xml version="1.0"?>
<scpd xmlns="urn:schemas-upnp-org:service-1-0">
  <specVersion>
    <major>1</major>
    <minor>0</minor>
  </specVersion>
  <actionList>
   <action>
      <name>MagicOn</name>
    </action>
  </actionList>
</scpd>

<!-- no-op -->
<guests>
  <guest>
    <firstName>John</firstName> <lastName>Doe</lastName>
  </guest>
  <guest>
    <firstName>María</firstName> <lastName>García</lastName>
  </guest>
  <guest>
    <firstName>Nikki</firstName> <lastName>Wolf</lastName>
  </guest>
</guests>

JavaScript Object Notation (JSON)

  • an open standard file format developed in 2001
  • has advantages over XML:
    • smaller file size as there are no end tags and thus faster data transmission
    • more readable by humans hence easier to read and write and thus less likely to have human errors and thus safer
  • however XML is more flexible and supports complex data types like binary data and timestamps
  • unlike most markup languages, JSON does NOT allow use of comments!
  • strings are enclosed in either single or double quotes
{"guests":[
  { "firstName":"John", "lastName":"Doe" },
  { "firstName":"María", "lastName":"García" },
  { "firstName":"Nikki", "lastName":"Wolf" }
]}

YAML Ain't Markup Language

  • introduced in 2001 to be even more human readable than JSON and like JSON and XML is designed to be a data serialization language and is now very commonly used in Python configuration files and thus in artificial intelligence
  • utilises white space and indentation for structure (somewhat similar to Python's use of indentation)
  • three dashes indicate the start of a new YAML document (as it supports multiple documents in one file)
  • nesting of data is done via indentation with two spaces - CANNOT use tabs to indent! - as different tools handle tabs differently
  • whitespace: unless otherwise indicated, newlines indicate the end of a field
  • lines are commented by starting with # (as in Python)
  • YAML's key-value pairs are scalar and the key is always a string. They act like the scalar types in languages like Perl, Javascript, and Python.
  • YAML strings are Unicode, in most situations, you don't have to specify them in quotes - it is only important to quote them when they contain a value that can be mistaken as a special character such as &
  • Mappings are used to associate key/value pairs that are unordered. Maps can be nested by increasing the indentation, or new maps can be created at the same level by resolving the previous one
  • Sequences in YAML are represented by using the hyphen (-) and space. They are ordered and can be embedded inside a map using indentation.
  • you can embed json in a yaml document
---
# <- yaml supports comments, json does not
# did you know you can embed json in yaml?
# try uncommenting the next line
# { foo: 'bar' }

json:
  - rigid
  - better for data interchange
yaml: 
  - slim and flexible
  - better for configuration
object:
	key: value
  array:
    - null_value:
    - boolean: true
    - integer: 1
    - alias: &example aliases are like variables
    - alias: *example
paragraph: >
   Blank lines denote

   paragraph breaks
content: |-
   Or we
   can auto
   convert line breaks
   to save space
alias: &foo
  bar: baz
alias_reuse: *foo 
it/textfiles.txt · Last modified: 2023/08/29 07:44 by gary1

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki