GSoC 2017 Final Evaluation

Project overview

My aim for this Summer of Code was to add to dateparser a support of search and parsing of date expressions in the large chunks of text. All the work results in a function which detects the language of a string, finds all substrings that represent dates, parses them and returns list of tuples with pairs: substring and a corresponding datetime.datetime object.

The work that was done had the following stages:
  • Implementing search based on translation of the string with the words from the language dictionary.
  • Figuring out the way to properly split dates from one another. In my case sentence boundaries and words that aren’t present in the dictionary are considered as splitters. All the language files were updated with the information on sentence splitters that are used in this language. Then dates that aren’t yet separated from one another are split by each of the characters from the list (includes whitespace, comma and other suchlike symbols) by several different methods (every splitter, every second splitter etc). And the best of the possible splits is chosen by the number of resulting valid date expressions. 
  • Implementing another language detector that will be able to work with large chunks of text. It uses the count of words from the language dictionary and also presence of the characters also gathered from the dictionary. If the new language is added to dateparser it should automatically become available to the detector.
All new code is covered by tests. The resulting function and the class that is used in this function are provided with docstrings.

Future work

Before finally being merged this project’s code needs to be adjusted with the other GSoC 2017 project for dateparser - Integration of unicode CLDR database. Both projects made changes in the dateparser’s native files, not to mention a huge amount of new code, so making it all work together will take some time, but I think the result will be awesome.

Links to the code

All the work that was done can be found in the pull requests that I made in the dateparser repository on GitHub:

Comments