- Entry level
- No Education
- Salary to negotiate
Compensation: $80k - 120k
Location options: Paid relocation
Job type: Contract
Experience level: Mid-Level, Senior
Role: Database Administrator
Industry: Heathcare Staffing, Technology Staffing
Company size: 51–200 people
Company type: Private
sql, amazon-web-services, cloud, python, apache-spark
We’re looking for a Data Engineer to help us transform our data systems and architecture to support greater variety, volume, and velocity of data and data sources. You might be a good fit if:
- You enjoy extracting data from a variety of sources and find ways to connect them and make them suitable for use in software systems and for the development of models and algorithms.
- You enjoy interacting with new database systems and learning new data technologies and are interesting in developing your knowledge of new tools and techniques.
- You are interested in automating data engineering efforts to minimize human interaction and optimizing data quality.
- You have an interest in developing your knowledge of practical data science techniques and technologies in addition to your data engineering knowledge and experience.
- This role requires comprehensive data engineering skills and is not a SQL developer role though SQL is a required skill.
What you'll do:
We’re looking for an experienced data engineer to help us:
- Build and Maintain serverless data ingestion and refresh pipelines in terabyte scale using AWS cloud services – Amazon Glue, Amazon Redshift, Amazon S3, Amazon Athena, DynamoDB, and others
- Incorporate new data sources from external vendors using flat files, APIs, web-scraping, and databases.
- Maintain and provide support for the existing data pipelines using Python, Glue, Spark, and SQL
- Work to develop and enhance the database architecture of the new analytic data environment that includes recommending optimal choices between relational, columnar, and document databases based on requirement
- Identify and deploy appropriate file formats for data ingestion into various storage and/or compute services via Glue for multiple use cases
- Develop real-time/near real-time data ingestion from web and web service logs from Splunk
- Maintain existing processes and develop new methods to match external data sources to our data using exact and fuzzy methods
- Implement and use machine learning based data wrangling tools like Trifacta to cleanse and reshape 3rd party data to make suitable for use.
- Develop and implement tests to ensure data quality across all integrated data sources.
- Serve as internal subject matter expert and coach to train team members in the use of distributed computing frameworks for data analysis and modeling including AWS services and Apache projects
- Master’s degree in Computer Science, Engineering, or equivalent work experience
- Two to four years’ experience working with datasets with hundreds of millions of rows using a variety of technologies
- Intermediate to expert level programming experience in Python and SQL in Windows and Mac/Linux environment
- Intermediate level experience working with distributed computing frameworks, especially Spark
- Intermediate level experience working with relational databases including PostgreSQL and Microsoft SQL Server
- Experience working with contemporary data file formats like Apache Parquet, Avro, and columnar databases like RedShift
- Experience working with distributed SQL query engines like Presto DB and Athena
- Experience with Amazon Web Services including Redshift, S3, Kinesis, Glue, and DynamoDB
- Experience analyzing data for data quality and supporting the use of data in an enterprise setting.
- Nice to have:
- Some experience working with clustering and classification models
- Some experience working with Trifacta
- Some experience working with Google Analytics
- Some familiarity working with RDFs and SparQL and some experience working with Graph Databases
- Experience with enterprise search engine systems including ElasticSearch and Apache Solr