Open Skills Project
The Open Skills Project is focused on providing a dynamic, up-to-date, locally-relevant, and normalized taxonomy of skills and jobs that builds on and expands on the Department of Labor’s O*NET data resources. This taxonomy is aimed at two groups:
- Developers, by way of the Open Skills API
- Researchers, using tabular data sets (coming soon!)
Collecting Data
Information used to create the Open Skills API comes from a variety of sources: - Public and private providers of job listings - O*NET jobs and skills taxonomy (https://www.onetonline.org/)
The job listings data sources are converted into a common job listing format, based on the schema.org Job Posting Schema (https://schema.org/JobPosting), and saved as JSON into an S3 folder according to the quarter(s) in which they are active.
Processing
ONET taxonomy data is transformed into master tables of jobs and skills, and associations between jobs and skills. Job posting titles are cleaned, aggregated into geographical counts. The titles and descriptions are indexed into Elasticsearch to implement a rudimentary job title normalizer.
Output
A tabular version of each processed data set is uploaded to a publicly accessible S3 bucket for use by researchers. The processed data is also loaded into a relational database, which the Open Skills API queries to retrieve data in response to user requests.
Code
To produce this output, a variety of extraction and processing tasks are used across four different code repositories.
- skills-utils contains common utilities regarding things like hashing, Elasticsearch, S3, to be used by different parts of the Open Skills Project.
- skills-ml contains processing algorithms and integrations with various open datasets to help compute our jobs and skills taxonomy.
- skills-airflow contains an orchestration workflow using the Airflow project that combines tasks from our skills-public-etl and skills-ml tasks to create aggregated data suitable for public consumption.
- skills-api contains a Flask application that runs the Open Skills API, making available data generated by the skills-airflow repository for developer use.