CUNY-CoLAG Language Domain Downloads

This page describes the file formats of two downloadable versions of the 2011 CoLAG domain. A more detailed description of the domain can be found in Sakas & Fodor (2012). A technical report that gives a fully specified description of the multi-language ‘supergrammar’ that generated the domain is available here. Description of the original version of the domain (circa 2003) and results from early investigations can be found in Sakas (2003) and Fodor and Sakas (2004) and elsewhere.

Any questions or difficulties concerning the downloadable files should be directed to William Sakas: 

The Formats

The CoLAG domain is downloadable in two formats:

Format 1) as a (large) single 'flat' file.           Download: (~56 MB)
Format 2) as four separate (smaller) files.    Download: (~13 MB)

The information contained in each format is identical. Format 1) is intended for batch (beginning to end) processing and Format 2) is suitable for importing into a relational database system for interactive querying.

Format 1)  contains much redundancy. For example, the CoLAG sentence S Verb O1 (subject verb direct-object) exists in over 400 CoLAG languages and is repeated for each language it exists in; albeit often with a different tree structure. But Format 1) is easy to handle if your preferred programming style is to process from beginning to end collecting summary data for future analysis. Format 2) has most of the redundancy removed and is importable into a relational database system for interactive querying (though we have also found that a single non-relational SQL table created by importing the large flat file works quite well if the proper indices are generated after the data has been imported). The specification of the file formats is given below.

The Sentences


Sentences in CoLAG consist of sequences of non-null lexical items (e.g., S, O1, Adv, Aux, Verb, etc.) and non-null features (e.g., DEC, Q, WH, etc.). Some example sentences are:


            S Aux[+FIN] Verb Adv [ILLOC DEC]

            Verb[+FIN] S P O3 [ILLOC Q]                           (P is preposition, O3 is object of preposition)

            O1[+WH] S O2 Verb[+FIN] ka [ILLOC Q]        (ka is a question marker)

            Verb[-FIN] O1 [ILLOC IMP]


Note that in the downloadable files, the illocutionary force feature (e.g., [ILLOC Q]) is maintained in its own column separated from the rest of the sentence. Also for readability, in these files the finiteness feature, ([+FIN]) is not shown in the sentences – it can be easily generated (for a sentence) if needed by:


If [ILLOC DEC] OR [ILLOC Q]   # if a declarative or a question

   if there is an Aux in the sentence [+FIN] is attached to the Aux

   else [+FIN] is attached to the Verb
# Note: Verb in an imperative does not receive the [+FIN] feature

All features (including [+FIN] and surface-null features, e.g., SLASH) appear in the bracketed tree structures of the sentences described below.


The Grammars


For both formats, CoLAG grammars are formulated in a principles and parameters framework and are represented in the downloadable files as a string of 13 zeros and/or ones which correspond to one or the other values of the thirteen binary parameters that distinguish the grammars in the domain. Individual grammars all combine a universal component (UG) with their relevant parameter values (see the Supergrammar).

The parameters and their values are listed in the table below.  The value of the first parameter (P1) is the leftmost character in the string and the value of the last parameter (P13) is the rightmost. For example the grammar: 0001001100011 licenses a language that is subject-initial, head-initial, complementizer-initial, has optional topics, no null subjects, no null topics, wh-movement, preposition stranding, no topic marking, no V to I movement, no I to C movement, it has affix hopping and question inversion.

It is important to note that the use of 0's and 1's to designate parameter values is for notational compactness. Actual parameter values take the form of 'treelets' (small fragments of tree structure); see discussion in Fodor 1998a  and Sakas & Fodor 2012 -  around Figure 2. 

Also to be noted is that  the choices of which linguistic phenomena are coded here as 0 and which as 1 are somewhat arbitrary; 0 does not necessarily imply a default value. On the role of default values in  learning, see discussion in Sakas & Fodor (2012).





Subject Position


Subject Initial

Subject Final


Headedness in IP, NegP, VP, PP


Head Initial

Head Final


Headedness in CP


Complementizer Initial

Complementizer Final


Optional (versus obligatory) Topic


Obligatory Topic

Optional Topic


Null Subject


No Null Subject

Optional Null Subject


Null Topic


No Null Topic

Optional Null Topic




No Wh-Movement

Obligatory Wh-Movement


Preposition Stranding


Pied Piping

Preposition Stranding


Topic Marking


No Topic Marking

Topic Marking


VtoI Movement


No VtoI Movement

Obligatory VtoI Movement


ItoC Movement


No ItoC Movement

Obligatory ItoC Movement


Affix Hopping


No Affix Hopping

Affix Hopping


Q-Inversion (ItoC in questions)


No Question Inversion

Obligatory Question Inversion



These conventional names for the parameters are used for convenient reference, but please bear in mind that the actual linguistic consequences of these parameters are not fully self-evident because they depend to various extents on how they interact with each other and with the ‘universal grammar’ (the UG) of the CoLAG domain.

The Trees


Syntactic tree structures are identical in both Format 1) and Format 2). CoLAG uses bracketed tree notation, which deviates from standard bracketed notation in that parentheses rather than square brackets are used to demarcate constituents. Square brackets are used to demarcate features. Terminals are surrounded by double quotes, though some terminals or features (e.g., [+NULL], [SLASH S]) may not be realized in the surface sentence.  For example, one bracketed tree structure (of several) for the sentence S Aux Verb O1 O2 [ILLOC DEC] in CoLAG is:




Flat file (Format 1)

The large flat file where each row contains data relevant for a single sentence as generated with a particular tree structure by a particular grammar (an ambiguous sentence will appear on more than one row with different tree structures). Data is separated by tabs (i.e., the file is 'tab-delimited'). Lines are terminated in MS Windows style, i.e., carriage return/line feed: ASCII codes 13 and 10 respectively.  There are 7 columns. There is no header line, though the column headers/field names we use for each are shown below in the table after the column number. Though not a 'necessary requirement', we would encourage users of the domain to be consistent with us in this regard.

Columns 1-4 contain the linguistic data. Columns 5, 6 and 7 are integer identifications (IDs) for the grammar, sentences and tree structures. These are included for efficiency; most programming languages are faster at comparing/manipulating numbers rather than strings and for certain queries the string representation is unneeded (e.g., Which CoLAG languages are subsets of other CoLAG languages?).

Column 1 (gramm):
A principles and parameters grammar consisting of a string of thirteen zeros and/or ones, e.g., 0001001100011.
Column 2 (illoc): 
The illocutionary force of the sentence consisting of one of Q, DEC or IMP (question, declarative or imperative).
Column 3 (sent):    The overt tokens and features that make up the sentence, e.g., S Aux Verb O1 O2.
Column 4 (struct): A bracketed representation of one tree structure for the sentence, given the grammar.
Column 5 (grammID):

An integer ID for the grammar in Column 1. This is simply the decimal value (base 10) of the grammar’s binary (base 2) representation.

Column 6 (sentID):
An arbitrarily assigned integer ID of the combination of the illocutionary force in Column 2 combined with the sentence in Column 3. For example Aux Verb O1 [ILLOC DEC] would have a different ID than Aux Verb O1 [ILLOC Q].
Column 7 (structID): An arbitrarily assigned integer ID for the structure in Column 4.

Columns 1-4 contain text data. The maximum character widths of the columns are:

Column 1:    13 characters

Column 2:       3 characters
Column 3:     50 characters
Column 4:   550 characters                   

Relational database (Format 2)

The description of this format makes reference to the columns and column headers outlined in Format 1) immediately above.

This format manages grammars, sentences and structures in three separate files together with a fourth file relating the three data files together.  Columns are tab-delimited, there are no header lines.

The grammars file, COLAG_2011_gramms.txt, contains two columns: grammID and gramm.
The sentences file,  COLAG_2011_sents.txt, contains three columns: sentID, illoc and sent.
The tree structures file, COLAG_2011_structs.txt, contains two columns: structID and struct.

The IDs file, COLAG_2011_IDs.txt, contains three columns: grammID, sentID and structID.

The grammars file, the sentences file and the tree structures file contain relevant linguistic information with redundancies removed - in each file there is one row for each unique element (grammar, sentence, or tree structure) in CoLAG. The IDs file "ties" the relationships between the grammars, sentences and structures together (i.e., which sentences and corresponding tree structures are licensed by which CoLAG grammars). The IDs file has exactly the same number of rows as the flat file described in Format 1), but without linguistic information. Database "joins" or "views" can  can be used to extract the linguistic information when required.