Policy Browser - By Cybera
The Policy Browser allows you to search and easily view all documents submitted to the CRTC during the consultation process. Currently, the only process loaded and displayed in the Policy Browser is for public process 2015-134.
The browser tab of the website provides functionality for the user to search through documents submitted to the CRTC in several ways outlined below. We note that some artifacts from the conversion of .pdf or .docx to raw text format do exist within some of the submissions available as searchable items through the browser.
The “Timeline” tab allows you to view the documents that were submitted in reverse chronological order as they were submitted to the CRTC. In order to view the content of each submission, simply click on the blue link of the submission, and its content should be displayed to the right of the timeline tab. It is also possible to expand/collapse the file lists by clicking the date header.
Under the organizations tab the documents have been sorted by which organization submitted them to the CRTC. The organizations in this tab are ordered alphabetically,and like the timeline view, simply click on the document name to view the submission,or the organization name to collapse their entries.
Under the queries tab it is possible to view the results of saved queries from both solr searches and doc2vec results. Under this tab you can click the exact search term that was used, and the results will be displayed on the right. Additionally the number of text segments that resulted from the search will appear below the search field.
Under the questions tab you can view (some of) the questions asked by the CRTC as well as some additional sub-questions that we had and looked into.
The Summary Button
To begin, each question has a blue “Summary” button which can be used to display important statistics about the text pulled via various queries meant to help answer the question.
There are currently four major summary components.
- The number of organizations with segments matching queries associated with the question out of all of the known organizations in the database.
- Categories matched: A summary table of coverage per category of organization. This shows the number of segments for a given category, along with the maximum quality value of any of those segments.
- Queries used: Every query associated with the question, along with how many segments match it, how many categories of organization those segments represent, how many organizations they represent, and the maximum quality of the segments.
- Missing organizations: A list of all the organizational groups that are known to have submitted a document to the CRTC whose document(s) did not appear to contain relevant text to the question at hand.
What is "Quality"?
A logged in administrator can assign every query a
quality score from 0.0 to 1.0. Currently, through the interface, it is only possible to assign a
quality score of 0.2, 0.4, 0.6, 0.8, or 1.0. Via the database or future refinements, a continuous scale of
quality scores could be assigned.
quality score assigned to a query is purely subjective. It is meant to deal with the following problem when collecting segments of text for further analysis on a question: Specific queries may return very high quality data but have very poor coverage. If you want to have high quality data and great coverage, you will likely want several very targeted queries which may not have great coverage, combined with some more general queries with greater coverage. The more targeted queries should be given a higher
quality score, which will allow someone doing downstream analyis to do things like select the top 5 highest quality segments per organization or category by simply filtering on a column of a data frame.
The Segments Button
By clicking this button under a particular question the browser will display the segments of text that have been machine located as relevant to the question at hand. By default you’ll see an expanded view of the results, where all results are displayed at once. However, it is possible to expand/collapse these results by clicking either “Categories” or “Organizations” From there the results are sorted by both category and organization to facilitate easy browsing of answers by either their organizational category, or by the organization themself.
To cut down on the amount of information in this view, only the top 5 results (as ordered by descending
quality score) are shown per category or organization. If that category/organization has fewer than 5 results total, it may show fewer results. The intent of this view is to provide an overview of the kind of segments getting collected for a category/organization. If you wish to look at all of the segments, you can download the full CSV for further analysis.
The CSV Button
This button allows you to download these categorized text segments for your own analysis as a CSV. Simply click the link to download the file, and the file is ready for import into Excel, R, Python, etc. to be manipulated and analysed as you see fit. The CSV files will have the following headings:
- document: the name of the document the segment is from
- segment: the actual text string pulled from the document
- query: the search string used to find the segment
- category: organizational category of the document submitter
- organization: the name of the organization submitting the text
- quality: subjective measure of quality of the text segment, derived from the
qualityscore assigned to the query it matches (see above)
Under this tab of the browser you are provided with the functionality to search through the CRTC documents for keywords/terms you supply yourself using Solr. This will allow you the functionality to perform your own fuzzy text searches through the documents. As a basic example of a Solr search through the data, let’s look for the term “affordable access” in the search bar, you would type:
Where this search will look for all word-pairs (case insensitive) “affordable access” from within the documents. However, sometimes those words won’t appear next to each other in the text, for example this would not find terms like “affordable and equal access” as our search terms are not adjacent. If we wanted some more flexibility, we can modify our query to allow for some space to be between our search terms as follows:
Where the “~5” tells solr that our search term(s) can be separated by up to five words in the text.
Of course, there are more complex queries you could apply. For example, another important keyword is OR which allows you to search for multiple terms at once. For example, suppose we wanted to search for “should be defined” and “should not be defined” simultaneously. To do that we make use of the OR keyword as follows:
content:("should be defined" OR "should not be defined")
Where now we’re looking for exact matches to both those strings simultaneously, rather than having to rely on the allowing space to be in between terms. Note the addition of parenthesis around the search term.
One final basic term of note is the
AND clause, which will allow you to search for two separate strings simultaneously. For example, suppose we wanted to search relevant to broadband speeds. In which case, a logical query would include “greater than” and “mbps”. This can be done with the following query:
content:("greater than" && "mbps")
Where we are now asking solr to find both of those terms.
Expanding on the above example it is possible to combine keywords directly. Perhaps we want to search for both “greater than” and “less than” simultaneously in the above query, rather than making two searches separately. These can be combined using the OR clause as follows:
content:(("greater than" OR "less than") && "mbps")
Where we again note the addition of further parenthesis. In this case we are now looking for either “greater than” or “less than” as well as “mbps”.
This basic functionality should be enough to get you started using the solr search function. We strongly encourage you to look at the Solr documentation as well, as there are many more options and combinations of search terms which will allow you to refine your solr queries to be both more efficient, and find segments of text that you may find to be more relevant.
Most queries will likely be on the
content field, as this contains the actual unstructured text (which is a primary driver for using Solr in the first place!). However, you are not limited to using only this field. Here is the full list of fields that have been imported into Solr that you may use:
- id: The Solr ID that can reference the document. Solr's default ID scheme has been overridden to use the sha256 hashes of document content that are used in the Neo4J graph database.
- sha256: The sha256 hash of the document contents. Even if the name of the document changes, this should remain the same, as long as the content in the document doesn't change.
- case: The case number from the CRTC site.
- ppn: The public process number from the CRTC site (right now, we only have documents for ppn 2015-134.
- dmid: The document management ID from the CRTC site. This should be uniqu per document and could be used to get back to the original document that was scraped.
- label: The Neo4J label associated with the node. Right now,everything in Solr will have the label "Document", but other types of nodes with unstructured text could be imported and differentiated here.
- name: The name of the document, either from the document scraper, or from its internal zipfile entry.
- submission_name: The name of the submission set the document was a part of (if it exists).
- type: The type of document. Generally, this is going to be "pdf", "doc", "docx", "html", "xls", etc., representing the original form of the document. There is one special type: "subdoc". This identifies Documents that weren't seperate on the original scrape, but derived from much larger documents. Some of the documents submitted were really collections of individual responses, so it made sense to split these up.
- content: The unstructured text contained within the document. This is pulled from the
contentfield of the related Neo4J node
A link to the GitHub repository containing source code and some examples of potential analysis can be found here. This source code also contains the web scrapers and post-processing scripts that will allow you to create your own neo4j database and file browser for another CRTC consultation process.
This project was funded through a grant by the Canadian Internet Registration Authority through its Community Investment Program.